Author Archives: gene_x

Functional Clustering of Genes Based on COG Terms Using Eggnog and Blast2GO

COG_relative_occurrence

  1. Generate protein fasta

    python3 getProteinSequences_1.py > Proximity_labelling.fasta
    python3 getProteinSequences_2.py > ALFA_pulldown.fasta
    python3 getProteinSequences_3.py > schnittmenge.fasta
  2. Using eggnog_env to generate the intial annotations

    mamba activate eggnog_env
    emapper.py -i Proximity_labelling.fasta -o eggnog_out_Proximity_labelling --cpu 60  #--resume
    emapper.py -i ALFA_pulldown.fasta -o eggnog_out_ALFA_pulldown --cpu 60
    emapper.py -i schnittmenge.fasta -o eggnog_out_schnittmenge --cpu 60
  3. Using BLAST2GO to generate GO terms and EC terms

  4. Using R to merge the two files blast2goannot.annot2 and eggnog_out.emapper.annotations.txt

    #mamba activate r_env
    # PREPARING go_terms and ec_terms: annot_* file: cut -f1-2 -d$'\t' blast2go_annot.annot2 > blast2go_annot.annot2_
    # PREPARING cp eggnog_out.emapper.annotations eggnog_out.emapper.annotations.txt. Note that the first last 3 lines starting with "#" should be removed; '#query' to query
    
    # Load required libraries
    library(openxlsx)  # For Excel file handling
    library(dplyr)     # For data manipulation
    library(clusterProfiler)  # For KEGG and GO enrichment analysis
    #library(org.Hs.eg.db)  # Replace with appropriate organism database
    #BiocManager::install("GO.db")
    #BiocManager::install("AnnotationDbi")
    library(GO.db)
    library(AnnotationDbi)
    
    # Step 1: Load the blast2go annotation file with a check for missing columns
    annot_df <- read.table("/home/jhuang/b2gWorkspace_schnittmenge/blast2go_annot.annot2_",
                        header = FALSE, sep = "\t", stringsAsFactors = FALSE, fill = TRUE)
    # If the structure is inconsistent, we can make sure there are exactly 3 columns:
    colnames(annot_df) <- c("GeneID", "Term")
    # Step 2: Filter and aggregate GO and EC terms as before
    go_terms <- annot_df %>%
    filter(grepl("^GO:", Term)) %>%
    group_by(GeneID) %>%
    summarize(GOs = paste(Term, collapse = ","), .groups = "drop")
    ec_terms <- annot_df %>%
    filter(grepl("^EC:", Term)) %>%
    group_by(GeneID) %>%
    summarize(EC = paste(Term, collapse = ","), .groups = "drop")
    
    eggnog_data <- read.delim("~/DATA/Data_Michelle_ProteinClustering/eggnog_out_schnittmenge.emapper.annotations.txt", header = TRUE, sep = "\t")
    
    # Perform the left joins and rename columns
    eggnog_data_updated <- eggnog_data %>%
    left_join(go_terms, by = "GeneID") %>%
    left_join(ec_terms, by = "GeneID")
    #%>% select(-EC.x, -GOs.x)
    #%>% rename(EC = EC.y, GOs = GOs.y)
    
    # Create a new workbook
    wb <- createWorkbook()
    # Add the complete dataset as the first sheet (with annotations)
    addWorksheet(wb, "Complete_Data")
    writeData(wb, "Complete_Data", eggnog_data_updated)
    # Save the workbook to a file
    saveWorkbook(wb, "Gene_with_Annotations_schnittmenge.xlsx", overwrite = TRUE)
    
    #Manually DELETE GOs.x and EC.x and rename GOs.y to GOs, EC.y to EC.
  5. Draw group bar plot for relative occurrence

    import matplotlib.pyplot as plt
    import pandas as pd
    import numpy as np
    
    # COG annotations for the two gene sets
    proximity_labelling = [
        'D', 'S', 'K', 'S', 'T', 'S', 'S', 'F', 'E', 'J', 'G', 'C', 'C', 'S', 'I', 'L', 'I', 'S', 'S', 'G', 'O', 'H', 'IQ', 'L', 'C', 'H', 'C', 'NU', 'O', 'C',
        'M', 'S', 'IQ', 'G', 'S', 'M', 'M', '-', 'M', 'S', 'J', 'S', 'IQ', 'K', 'E', 'J', 'K', 'E', 'J', 'S', 'K', 'E', 'S', 'J', 'S', 'V', 'C', 'I', 'K', 'IQ',
        'M', 'S', 'T', 'P', 'C', 'J', 'S', 'E', 'C', 'K', 'H', 'S', 'H', 'H'
    ]
    
    alfa_pulldown = [
        'K', 'S', 'P', 'K', 'C', 'K', 'I', 'O', '-', 'F', 'C', 'P', 'HQ', '-', 'L', 'L', 'S', 'G', 'IQ', 'S'
    ]
    
    # Define all possible COG categories (including the combinations)
    all_cog_categories_ = ['J','A', 'K', 'L', 'B', 'D', 'Y', 'V', 'T', 'M', 'N', 'Z', 'W', 'U', 'O', 'C','G', 'E', 'F', 'H', 'I', 'P', 'Q', 'R', 'S']
    all_cog_categories = all_cog_categories_[::-1]
    
    # Define functional groups for COG categories
    functional_groups = {
        "INFORMATION STORAGE AND PROCESSING": ['J', 'A', 'K', 'L', 'B'],
        "CELLULAR PROCESSES AND SIGNALING": ['D', 'Y', 'V', 'T', 'M', 'N', 'Z', 'W', 'U', 'O'],
        "METABOLISM": ['C', 'G', 'E', 'F', 'H', 'I', 'P', 'Q'],
        "POORLY CHARACTERIZED": ['R', 'S']
    }
    
    # Function to count occurrences of each COG category in a gene set, handling combinations
    def count_cog_categories(gene_set):
        counts = {cog: gene_set.count(cog) for cog in all_cog_categories}
    
        # Treat combinations (e.g., IQ, HQ, NU) as belonging to both respective categories
        combination_categories = {
            'IQ': ['I', 'Q'],
            'HQ': ['H', 'Q'],
            'NU': ['N', 'U']
        }
    
        for comb, categories in combination_categories.items():
            comb_count = gene_set.count(comb)
            for category in categories:
                counts[category] += comb_count
            counts[comb] = 0  # Remove the combination itself as it is distributed
    
        return counts
    
    # Count COG categories for each gene set
    proximity_counts = count_cog_categories(proximity_labelling)
    alfa_counts = count_cog_categories(alfa_pulldown)
    
    # Calculate relative occurrence (percentage) of each COG category for both gene sets
    total_proximity = len(proximity_labelling)
    total_alfa = len(alfa_pulldown)
    
    # Calculate percentages for each COG category in both gene sets
    proximity_percentage = {cog: count / total_proximity * 100 for cog, count in proximity_counts.items()}
    alfa_percentage = {cog: count / total_alfa * 100 for cog, count in alfa_counts.items()}
    
    # Create a DataFrame to store the percentages for both gene sets
    df_percentage = pd.DataFrame({
        'Proximity Labelling (%)': [proximity_percentage.get(cog, 0) for cog in all_cog_categories],
        'ALFA Pulldown (%)': [alfa_percentage.get(cog, 0) for cog in all_cog_categories]
    }, index=all_cog_categories)
    
    ## Reverse the order of the COG categories
    #df_percentage = df_percentage[::-1]
    # Round the values to 1 decimal place
    df_percentage = df_percentage.round(1)
    
    # Save the relative occurrence data in a text file
    file_path = "COG_relative_occurrence.txt"
    df_percentage.to_csv(file_path, sep='\t', header=True, index=True)
    
    # Save the relative occurrence data in an Excel file
    file_path = "COG_relative_occurrence.xlsx"
    df_percentage.to_excel(file_path, header=True, index=True)
    
    # Display file path where the data was saved
    print(f"Relative occurrence data has been saved to {file_path}")
    
    # --------------------
    # Horizontal Grouped Bar Plot
    # --------------------
    
    # Plot a horizontal grouped bar chart
    fig, ax = plt.subplots(figsize=(12, 8))
    
    # Create the positions for each group of bars
    bar_height = 0.35
    index = np.arange(len(df_percentage))
    
    # Plot bars for both gene sets
    bar1 = ax.barh(index, df_percentage['Proximity Labelling (%)'], bar_height, label='Proximity Labelling', color='skyblue')
    bar2 = ax.barh(index + bar_height, df_percentage['ALFA Pulldown (%)'], bar_height, label='ALFA Pulldown', color='lightgreen')
    
    # Customize the plot
    ax.set_ylabel('COG Categories')
    ax.set_xlabel('Relative Occurrence (%)')
    ax.set_title('Relative COG Category Distribution')
    ax.set_yticks(index + bar_height / 2)
    ax.set_yticklabels(all_cog_categories)
    
    # Annotate functional groups
    #, fontweight='bold'
    for group, categories in functional_groups.items():
        # Find indices of the categories in this group
        group_indices = [all_cog_categories.index(cog) for cog in categories]
        # Add a label for the functional group
        ax.annotate(group, xy=(1, group_indices[0] + 0.5), xycoords='data', fontsize=9, color='black')
    
    # Add legend
    ax.legend()
    
    # Show the plot
    plt.tight_layout()
    plt.show()

Beitragsbemessungsgrenzen und Rechengrößen in der Sozialversicherung 2023

https://www.krankenkassen.de/gesetzliche-krankenkassen/system-gesetzliche-krankenversicherung/sozialversicherung-rechengroessen-beitragsbemessungsgrenze-versicherungspflichtgrenze/rechengroessen-2023/?utm_source=chatgpt.com

Beitragsfreie Familienversicherung

Eine beitragsfreie Familienversicherung ist nur möglich, wenn das Einkommen des betroffenen Ehepartner 485 Euro nicht übersteigt (ein siebtel der Bezugsgröße).

Bezugsgröße in der Sozialversicherung

Das vorläufige Durchschnittsentgelt für das Jahr 2021 in der Rentenversicherung beträgt 40.463 Euro. Das vorläufige Durchschnittsentgelt für das Jahr 2022 in der Rentenversicherung beträgt 43.142 Euro. Die Bezugsgröße in der Sozialversicherung 2023 steigt auf 3.395 Euro monatlich (West) und auf 3290 Euro monatlich (Ost) .

Mindestbemessungsgrundlage für Selbstständige

Aus der Bezugsgröße läßt sich die Mindestbemessungsgröße für Selbstständige mit niedrigem Einkommen ableiten. Die Formel: Mindesbemessungsgrundlage=(Bezugsgröße/90)x30.

Für das Jahr 2023 ergibt sich damit als Mindestbemessungsgrundlage für Selbstständige (3395/90)x30=1131,67 Euro.

Beitragsberechnung bei privat versichertem Ehegatten

Höchstgrenze des zu berücksichtigenden Einkommens des privat versicherten Ehepartners oder Lebenspartners bei der Beitragsberechnung des gesetzlich Versicherten (Hälfte der Beitragsbemessungsgrenze): 2493,75 Euro.

Beitragsfreie Familienversicherung

Im Jahr 2023 lag die Einkommensgrenze für die beitragsfreie Familienversicherung in der gesetzlichen Krankenversicherung bei 485 Euro monatlich. krankenkassen.de Diese Grenze entspricht einem Siebtel der monatlichen Bezugsgröße in der Sozialversicherung. Für Angehörige, die eine geringfügig entlohnte Beschäftigung (Minijob) ausüben, lag die Einkommensgrenze bei 505 Euro monatlich. Bitte beachten Sie, dass diese Werte für das Jahr 2023 gelten und sich in den folgenden Jahren ändern können. Für aktuelle Informationen empfehle ich, die offiziellen Veröffentlichungen der Deutschen Rentenversicherung oder der Deutschen Gesetzlichen Unfallversicherung zu konsultieren.

https://www.sbk.org/mitglied-werden/familie/?mtm_source=google&mtm_medium=cpc&mtm_content=mitglied-werden&mtm_kwd=krankenkasse%20familie&mtm_campaign=familie&gad_source=1&gclid=CjwKCAiAzba9BhBhEiwA7glbatV6hNpQV36WpQqFMXk02AmWazMUrT8JMstKq3UA2Ky0GRuVy9_RTxoCx8YQAvD_BwE

Ein Beitrag, alles drin – für die ganze Familie

Mit der Familienversicherung der SBK sind Ihre Angehörigen in der Krankenversicherung und in der Pflegeversicherung beitragsfrei mitversichert. Und genießen genauso wie Sie alle SBK-Vorteile. Dies gilt in der Regel für alle Familienmitglieder, deren regelmäßiges monatliches Einkommen nicht über 535 € (2025) liegt.

KEGG and GO annotations in non-model organisms

https://www.biobam.com/functional-analysis/

Blast2GO_workflow

Venn_Diagram_Tam_2025_1

  1. Assign KEGG and GO Terms (see diagram above)

    Since your organism is non-model, standard R databases (org.Hs.eg.db, etc.) won’t work. You’ll need to manually retrieve KEGG and GO annotations.

    Option 1 (KEGG Terms): EggNog based on orthology and phylogenies

    EggNOG-mapper assigns both KEGG Orthology (KO) IDs and GO terms.
    
    Install EggNOG-mapper:
    
        mamba create -n eggnog_env python=3.8 eggnog-mapper -c conda-forge -c bioconda  #eggnog-mapper_2.1.12
        mamba activate eggnog_env
    
    Run annotation:
    
        #diamond makedb --in eggnog6.prots.faa -d eggnog_proteins.dmnd
        mkdir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
        download_eggnog_data.py --dbname eggnog.db -y --data_dir /home/jhuang/mambaforge/envs/eggnog_env/lib/python3.8/site-packages/data/
        #NOT_WORKING: emapper.py -i CP059040_gene.fasta -o eggnog_dmnd_out --cpu 60 -m diamond[hmmer,mmseqs] --dmnd_db /home/jhuang/REFs/eggnog_data/data/eggnog_proteins.dmnd
        python ~/Scripts/update_fasta_header.py CP059040_protein_.fasta CP059040_protein.fasta
        emapper.py -i CP059040_protein.fasta -o eggnog_out --cpu 60 --resume
        #----> result annotations.tsv: Contains KEGG, GO, and other functional annotations.
        #---->  470.IX87_14445:
            * 470 likely refers to the organism or strain (e.g., Acinetobacter baumannii ATCC 19606 or another related strain).
            * IX87_14445 would refer to a specific gene or protein within that genome.
    
    Extract KEGG KO IDs from annotations.emapper.annotations.

    Option 2 (GO Terms from ‘Blast2GO 5 Basic’, saved in blast2go_annot.annot): Using Blast/Diamond + Blast2GO_GUI based on sequence alignment + GO mapping

    • ‘Load protein sequences’ (Tags: NONE, generated columns: Nr, SeqName) –>
    • Buttons ‘blast’ (Tags: BLASTED, generated columns: Description, Length, #Hits, e-Value, sim mean),
    • Button ‘mapping’ (Tags: MAPPED, generated columns: #GO, GO IDs, GO Names), “Mapping finished – Please proceed now to annotation.”
    • Button ‘annot’ (Tags: ANNOTATED, generated columns: Enzyme Codes, Enzyme Names), “Annotation finished.”

      • Used parameter ‘Annotation CutOff’: The Blast2GO Annotation Rule seeks to find the most specific GO annotations with a certain level of reliability. An annotation score is calculated for each candidate GO which is composed by the sequence similarity of the Blast Hit, the evidence code of the source GO and the position of the particular GO in the Gene Ontology hierarchy. This annotation score cutoff select the most specific GO term for a given GO branch which lies above this value.
      • Used parameter ‘GO Weight’ is a value which is added to Annotation Score of a more general/abstract Gene Ontology term for each of its more specific, original source GO terms. In this case, more general GO terms which summarise many original source terms (those ones directly associated to the Blast Hits) will have a higher Annotation Score.

      or blast2go_cli_v1.5.1 (NOT_USED)

      #https://help.biobam.com/space/BCD/2250407989/Installation
      #see ~/Scripts/blast2go_pipeline.sh

    Option 3 (GO Terms from ‘Blast2GO 5 Basic’, saved in blast2go_annot.annot2): Interpro based protein families / domains –> Button interpro

    • Button ‘interpro’ (Tags: INTERPRO, generated columns: InterPro IDs, InterPro GO IDs, InterPro GO Names) –> “InterProScan Finished – You can now merge the obtained GO Annotations.”

    MERGE the results of InterPro GO IDs (Option 3) to GO IDs (Option 2) and generate final GO IDs

    • Button ‘interpro’/’Merge InterProScan GOs to Annotation’ –> “Merge (add and validate) all GO terms retrieved via InterProScan to the already existing GO annotation.” –> “Finished merging GO terms from InterPro with annotations. Maybe you want to run ANNEX (Annotation Augmentation).” #* Button ‘annot’/’ANNEX’ –> “ANNEX finished. Maybe you want to do the next step: Enzyme Code Mapping.”

      #– before merging (blast2go_annot.annot) — #H0N29_18790 GO:0004842 ankyrin repeat domain-containing protein #H0N29_18790 GO:0085020 #– after merging (blast2go_annot.annot2) –> #H0N29_18790 GO:0031436 ankyrin repeat domain-containing protein #H0N29_18790 GO:0070531 #H0N29_18790 GO:0004842 #H0N29_18790 GO:0005515 #H0N29_18790 GO:0085020

    Option 4 (NOT_USED): RFAM for non-colding RNA

    Option 5 (NOT_USED): PSORTb for subcellular localizations

    Option 6 (NOT_USED): KAAS (KEGG Automatic Annotation Server)

    • Go to KAAS
    • Upload your FASTA file.
    • Select an appropriate gene set.
    • Download the KO assignments.
  2. Find the Closest KEGG Organism Code (NOT_USED)

    Since your species isn’t directly in KEGG, use a closely related organism.

    • Check available KEGG organisms:

      library(clusterProfiler)
      library(KEGGREST)
      
      kegg_organisms <- keggList("organism")
      
      Pick the closest relative (e.g., zebrafish "dre" for fish, Arabidopsis "ath" for plants).
      
      # Search for Acinetobacter in the list
      grep("Acinetobacter", kegg_organisms, ignore.case = TRUE, value = TRUE)
      # Gammaproteobacteria
      #Extract KO IDs from the eggnog results for  "Acinetobacter baumannii strain ATCC 19606"
  3. Find the Closest KEGG Organism for a Non-Model Species

    If your organism is not in KEGG, search for the closest relative:

        grep("fish", kegg_organisms, ignore.case = TRUE, value = TRUE)  # Example search

    For KEGG pathway enrichment in non-model species, use “ko” instead of a species code (the code has been intergrated in the point 4):

        kegg_enrich <- enrichKEGG(gene = gene_list, organism = "ko")  # "ko" = KEGG Orthology
  4. Perform KEGG and GO Enrichment in R (under dir ~/DATA/Data_Tam_RNAseq_2024/results/star_salmon/degenes)

        #BiocManager::install("GO.db")
        #BiocManager::install("AnnotationDbi")
    
        # Load required libraries
        library(openxlsx)  # For Excel file handling
        library(dplyr)     # For data manipulation
        library(tidyr)
        library(stringr)
        library(clusterProfiler)  # For KEGG and GO enrichment analysis
        #library(org.Hs.eg.db)  # Replace with appropriate organism database
        library(GO.db)
        library(AnnotationDbi)
    
        # PREPARING go_terms and ec_terms: annot_* file: cut -f1-2 -d$'\t' blast2go_annot.annot2 > blast2go_annot.annot2_
        # Step 1: Load the blast2go annotation file with a check for missing columns
        annot_df <- read.table("/home/jhuang/b2gWorkspace_Tam_RNAseq_2024/blast2go_annot.annot2_",
                            header = FALSE, sep = "\t", stringsAsFactors = FALSE, fill = TRUE)
    
        # If the structure is inconsistent, we can make sure there are exactly 3 columns:
        colnames(annot_df) <- c("GeneID", "Term")
        # Step 2: Filter and aggregate GO and EC terms as before
        go_terms <- annot_df %>%
        filter(grepl("^GO:", Term)) %>%
        group_by(GeneID) %>%
        summarize(GOs = paste(Term, collapse = ","), .groups = "drop")
        ec_terms <- annot_df %>%
        filter(grepl("^EC:", Term)) %>%
        group_by(GeneID) %>%
        summarize(EC = paste(Term, collapse = ","), .groups = "drop")
    
        # Load the results
        #res <- read.csv("AUM_vs_MHB-all.csv")     #up149, down65
        #res <- read.csv("Urine_vs_AUM-all.csv")    #up155, down105
        res <- read.csv("Urine_vs_MHB-all.csv")   #up259, down138
    
        # Replace empty GeneName with modified GeneID
        res$GeneName <- ifelse(
            res$GeneName == "" | is.na(res$GeneName),
            gsub("gene-", "", res$GeneID),
            res$GeneName
        )
    
        # Remove duplicated genes by selecting the gene with the smallest padj
        duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
    
        res <- res %>%
        group_by(GeneName) %>%
        slice_min(padj, with_ties = FALSE) %>%
        ungroup()
    
        res <- as.data.frame(res)
        # Sort res first by padj (ascending) and then by log2FoldChange (descending)
        res <- res[order(res$padj, -res$log2FoldChange), ]
        # Read eggnog annotations
        eggnog_data <- read.delim("~/DATA/Data_Tam_RNAseq_2024/eggnog_out.emapper.annotations.txt", header = TRUE, sep = "\t")
        # Remove the "gene-" prefix from GeneID in res to match eggnog 'query' format
        res$GeneID <- gsub("gene-", "", res$GeneID)
        # Merge eggnog data with res based on GeneID
        res <- res %>%
        left_join(eggnog_data, by = c("GeneID" = "query"))
    
        # DEBUG: NOT_NECESSARY, since res has already GeneName
        ##Convert row names to a new column 'GeneName' in res
        #res_with_geneName <- res %>%
        #mutate(GeneName = rownames(res)) %>%
        #as.data.frame()  # Ensure that it's a regular data frame without row names
        ## View the result
        #head(res_with_geneName)
    
        # Merge with the res dataframe
        # Perform the left joins and rename columns
        res_updated <- res %>%
        left_join(go_terms, by = "GeneID") %>%
        left_join(ec_terms, by = "GeneID") %>% dplyr::select(-EC.x, -GOs.x) %>% dplyr::rename(EC = EC.y, GOs = GOs.y)
    
        # DEBUG: NOT_NECESSARY, since 'GeneName' is already the first column.
        ## Reorder columns to move 'GeneName' as the first column in res_updated
        #res_updated <- res_updated %>%
        #select(GeneName, everything())
    
        ## Count the number of rows in the KEGG_ko, GOs, EC columns that have non-missing values
        #num_non_missing_KEGG_ko <- sum(res_updated$KEGG_ko != "-" & !is.na(res_updated$KEGG_ko))
        #print(num_non_missing_KEGG_ko)
        ##[1] 2030
        #num_non_missing_GOs <- sum(res_updated$GOs != "-" & !is.na(res_updated$GOs))
        #print(num_non_missing_GOs)
        ##[1] 2865 --> 2875
        #num_non_missing_EC <- sum(res_updated$EC != "-" & !is.na(res_updated$EC))
        #print(num_non_missing_EC)
        ##[1] 1701
    
        # Filter up-regulated genes
        up_regulated <- res_updated[res_updated$log2FoldChange > 2 & res_updated$padj < 0.01, ]
        # Filter down-regulated genes
        down_regulated <- res_updated[res_updated$log2FoldChange < -2 & res_updated$padj < 0.01, ]
    
        # Create a new workbook
        wb <- createWorkbook()
        # Add the complete dataset as the first sheet (with annotations)
        addWorksheet(wb, "Complete_Data")
        writeData(wb, "Complete_Data", res_updated)
        # Add the up-regulated genes as the second sheet (with annotations)
        addWorksheet(wb, "Up_Regulated")
        writeData(wb, "Up_Regulated", up_regulated)
        # Add the down-regulated genes as the third sheet (with annotations)
        addWorksheet(wb, "Down_Regulated")
        writeData(wb, "Down_Regulated", down_regulated)
        # Save the workbook to a file
        saveWorkbook(wb, "Gene_Expression_with_Annotations_Urine_vs_MHB.xlsx", overwrite = TRUE)
    
        # Set GeneName as row names after the join
        rownames(res_updated) <- res_updated$GeneName
        res_updated <- res_updated %>% dplyr::select(-GeneName)
        ## Set the 'GeneName' column as row.names
        #rownames(res_updated) <- res_updated$GeneName
        ## Drop the 'GeneName' column since it's now the row names
        #res_updated$GeneName <- NULL
    
        # ---- Perform KEGG enrichment analysis (up_regulated) ----
        gene_list_kegg_up <- up_regulated$KEGG_ko
        gene_list_kegg_up <- gsub("ko:", "", gene_list_kegg_up)
        kegg_enrichment_up <- enrichKEGG(gene = gene_list_kegg_up, organism = 'ko')
        # -- convert the GeneID (Kxxxxxx) to the true GeneID --
        # Step 0: Create KEGG to GeneID mapping
        kegg_to_geneid_up <- up_regulated %>%
        dplyr::select(KEGG_ko, GeneID) %>%
        filter(!is.na(KEGG_ko)) %>%  # Remove missing KEGG KO entries
        mutate(KEGG_ko = str_remove(KEGG_ko, "ko:"))  # Remove 'ko:' prefix if present
        # Step 1: Clean KEGG_ko values (separate multiple KEGG IDs)
        kegg_to_geneid_clean <- kegg_to_geneid_up %>%
        mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>%  # Remove 'ko:' prefixes
        separate_rows(KEGG_ko, sep = ",") %>%  # Ensure each KEGG ID is on its own row
        filter(KEGG_ko != "-") %>%  # Remove invalid KEGG IDs ("-")
        distinct()  # Remove any duplicate mappings
        # Step 2.1: Expand geneID column in kegg_enrichment_up
        expanded_kegg <- kegg_enrichment_up %>%
        as.data.frame() %>%
        separate_rows(geneID, sep = "/") %>%  # Split multiple KEGG IDs (Kxxxxx)
        left_join(kegg_to_geneid_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>%  # Explicitly handle many-to-many
        distinct() %>%  # Remove duplicate matches
        group_by(ID) %>%
        summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop")  # Re-collapse results
        #dplyr::glimpse(expanded_kegg)
        # Step 3.1: Replace geneID column in the original dataframe
        kegg_enrichment_up_df <- as.data.frame(kegg_enrichment_up)
        # Remove old geneID column and merge new one
        kegg_enrichment_up_df <- kegg_enrichment_up_df %>%
        dplyr::select(-geneID) %>%  # Remove old geneID column
        left_join(expanded_kegg %>% dplyr::select(ID, GeneID), by = "ID") %>%  # Merge new GeneID column
        dplyr::rename(geneID = GeneID)  # Rename column back to geneID
    
        # ---- Perform KEGG enrichment analysis (down_regulated) ----
        # Step 1: Extract KEGG KO terms from down-regulated genes
        gene_list_kegg_down <- down_regulated$KEGG_ko
        gene_list_kegg_down <- gsub("ko:", "", gene_list_kegg_down)
        # Step 2: Perform KEGG enrichment analysis
        kegg_enrichment_down <- enrichKEGG(gene = gene_list_kegg_down, organism = 'ko')
        # --- Convert KEGG gene IDs (Kxxxxxx) to actual GeneIDs ---
        # Step 3: Create KEGG to GeneID mapping from down_regulated dataset
        kegg_to_geneid_down <- down_regulated %>%
        dplyr::select(KEGG_ko, GeneID) %>%
        filter(!is.na(KEGG_ko)) %>%  # Remove missing KEGG KO entries
        mutate(KEGG_ko = str_remove(KEGG_ko, "ko:"))  # Remove 'ko:' prefix if present
        # Step 4: Clean KEGG_ko values (handle multiple KEGG IDs)
        kegg_to_geneid_down_clean <- kegg_to_geneid_down %>%
        mutate(KEGG_ko = str_remove_all(KEGG_ko, "ko:")) %>%  # Remove 'ko:' prefixes
        separate_rows(KEGG_ko, sep = ",") %>%  # Ensure each KEGG ID is on its own row
        filter(KEGG_ko != "-") %>%  # Remove invalid KEGG IDs ("-")
        distinct()  # Remove duplicate mappings
        # Step 5: Expand geneID column in kegg_enrichment_down
        expanded_kegg_down <- kegg_enrichment_down %>%
        as.data.frame() %>%
        separate_rows(geneID, sep = "/") %>%  # Split multiple KEGG IDs (Kxxxxx)
        left_join(kegg_to_geneid_down_clean, by = c("geneID" = "KEGG_ko"), relationship = "many-to-many") %>%  # Handle many-to-many mappings
        distinct() %>%  # Remove duplicate matches
        group_by(ID) %>%
        summarise(across(everything(), ~ paste(unique(na.omit(.)), collapse = "/")), .groups = "drop")  # Re-collapse results
        # Step 6: Replace geneID column in the original kegg_enrichment_down dataframe
        kegg_enrichment_down_df <- as.data.frame(kegg_enrichment_down) %>%
        dplyr::select(-geneID) %>%  # Remove old geneID column
        left_join(expanded_kegg_down %>% dplyr::select(ID, GeneID), by = "ID") %>%  # Merge new GeneID column
        dplyr::rename(geneID = GeneID)  # Rename column back to geneID
        # View the updated dataframe
        head(kegg_enrichment_down_df)
    
        # Create a new workbook
        wb <- createWorkbook()
        # Save enrichment results to the workbook
        addWorksheet(wb, "KEGG_Enrichment_Up")
        writeData(wb, "KEGG_Enrichment_Up", as.data.frame(kegg_enrichment_up_df))
        # Save enrichment results to the workbook
        addWorksheet(wb, "KEGG_Enrichment_Down")
        writeData(wb, "KEGG_Enrichment_Down", as.data.frame(kegg_enrichment_down_df))
    
        # ---- Perform GO enrichment analysis (TODO: extract the merged GO IDs from 'Blast2GO 5 Basic' and adapt the code below!)----
        #http://yulab-smu.top/biomedical-knowledge-mining-book/clusterprofiler-go.html
        #If a user has GO annotation data (in a data.frame format with the first column as gene ID and the second column as GO ID), they can use the enricher() and GSEA() functions to perform an over-representation test and gene set enrichment analysis.
        ## Perform GO enrichment analysis (Using clusterProfiler does not work, use enricher instead.)
        #go_enrichment_up <- enrichGO(gene = gene_list_go_up, OrgDb, ont = "BP")  #keyType = "ENSEMBL", organism = 'abn',
        #To adapt your code for GO enrichment analysis using the up-regulated genes (149 genes) as the test set and all genes in res (3646 genes) as the background, we need to:
        # * Extract the GeneIDs from up_regulated and use them as gene_list (the test set).
        # * Extract the GeneIDs from res and use them as background_genes (the universe).
        # * Create a TERM2GENE dataframe by extracting GO terms from res.
        # * Run the enricher() function using these updated inputs.
    
        # Define gene list (up-regulated genes)
        gene_list_go_up <- up_regulated$GeneID  # Extract the 149 up-regulated genes
        gene_list_go_down <- down_regulated$GeneID  # Extract the 65 down-regulated genes
    
        # Define background gene set (all genes in res)
        background_genes <- res_updated$GeneID  # Extract the 3646 background genes
    
        # Prepare GO annotation data from res
        go_annotation <- res_updated[, c("GOs","GeneID")]  # Extract relevant columns
        go_annotation <- go_annotation %>%
        tidyr::separate_rows(GOs, sep = ",")  # Split multiple GO terms into separate rows
    
        # Perform GO enrichment analysis, where pAdjustMethod is one of "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none"
        #Here’s a ranking from least restrictive to most restrictive:
        #    * none – No correction at all.
        #    * fdr (same as BH, Benjamini-Hochberg) – Controls the false discovery rate rather than the family-wise error rate (FWER), allowing more significant results.
        #    * BY (Benjamini-Yekutieli) – Similar to BH but more conservative, accounting for dependence in test statistics.
        #    * Holm – A step-down version of Bonferroni, less strict than Bonferroni but still controls FWER.
        #    * Hochberg – Slightly more powerful than Holm, but still controlling FWER.
        #    * Hommel – A step-up approach controlling FWER, similar to Hochberg but slightly more powerful in some cases.
        #    * Bonferroni – The most conservative method, dividing the p-value threshold by the number of comparisons, leading to very strict significance criteria.
        #So, if you want the least restrictive correction while still controlling for multiple testing, use "fdr" (or "BH"). If you need a more stringent control over Type I errors, methods like Bonferroni or Holm are preferable.
    
        go_enrichment_up <- enricher(
            gene = gene_list_go_up,                # Up-regulated genes
            TERM2GENE = go_annotation,       # Custom GO annotation
            pvalueCutoff = 1.0,             # Significance threshold
            pAdjustMethod = "BH",
            universe = background_genes      # Define the background gene set
        )
        go_enrichment_up <- as.data.frame(go_enrichment_up)
    
        go_enrichment_down <- enricher(
            gene = gene_list_go_down,                # Up-regulated genes
            TERM2GENE = go_annotation,       # Custom GO annotation
            pvalueCutoff = 1.0,             # Significance threshold
            pAdjustMethod = "BH",
            universe = background_genes      # Define the background gene set
        )
        go_enrichment_down <- as.data.frame(go_enrichment_down)
    
        ## Remove the 'p.adjust' column since no adjusted methods have been applied!
        #go_enrichment_up <- go_enrichment_up[, !names(go_enrichment_up) %in% "p.adjust"]
        # Update the Description column with the term descriptions
        go_enrichment_up$Description <- sapply(go_enrichment_up$ID, function(go_id) {
        # Using select to get the term description
        term <- tryCatch({
            AnnotationDbi::select(GO.db, keys = go_id, columns = "TERM", keytype = "GOID")
        }, error = function(e) {
            message(paste("Error for GO term:", go_id))  # Print which GO ID caused the error
            return(data.frame(TERM = NA))  # In case of error, return NA
        })
    
        if (nrow(term) > 0) {
            return(term$TERM)
        } else {
            return(NA)  # If no description found, return NA
        }
        })
        ## Print the updated data frame
        #print(go_enrichment_up)
    
        ## Remove the 'p.adjust' column since no adjusted methods have been applied!
        #go_enrichment_down <- go_enrichment_down[, !names(go_enrichment_down) %in% "p.adjust"]
        # Update the Description column with the term descriptions
        go_enrichment_down$Description <- sapply(go_enrichment_down$ID, function(go_id) {
        # Using select to get the term description
        term <- tryCatch({
            AnnotationDbi::select(GO.db, keys = go_id, columns = "TERM", keytype = "GOID")
        }, error = function(e) {
            message(paste("Error for GO term:", go_id))  # Print which GO ID caused the error
            return(data.frame(TERM = NA))  # In case of error, return NA
        })
    
        if (nrow(term) > 0) {
            return(term$TERM)
        } else {
            return(NA)  # If no description found, return NA
        }
        })
    
        addWorksheet(wb, "GO_Enrichment_Up")
        writeData(wb, "GO_Enrichment_Up", as.data.frame(go_enrichment_up))
    
        addWorksheet(wb, "GO_Enrichment_Down")
        writeData(wb, "GO_Enrichment_Down", as.data.frame(go_enrichment_down))
    
        # Save the workbook with enrichment results
        saveWorkbook(wb, "KEGG_and_GO_Enrichments_Urine_vs_MHB.xlsx", overwrite = TRUE)
    
        #Error for GO term: GO:0006807: replace GO:0006807  obsolete nitrogen compound metabolic process
        #TODO: marked the color as yellow if the p.adjusted <= 0.05 in GO_enrichment!
  5. Draw the Venn diagram to compare the total DEGs across AUM, Urine, and MHB, irrespective of up- or down-regulation.

        # 安装 & 加载必要的库
        if (!requireNamespace("ggvenn", quietly = TRUE)) install.packages("ggvenn")
        if (!requireNamespace("openxlsx", quietly = TRUE)) install.packages("openxlsx")
        #install.packages("ggvenn")
        #install.packages("VennDiagram")
        #install.packages("ggVennDiagram")
    
        library(ggvenn)
        library(openxlsx)
        library(VennDiagram)
    
        res <- read.csv("AUM_vs_MHB-all.csv")     #up149, down65
        res$GeneName <- ifelse(
            res$GeneName == "" | is.na(res$GeneName),
            gsub("gene-", "", res$GeneID),
            res$GeneName
        )
        duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
        res <- res %>%
        group_by(GeneName) %>%
        slice_min(padj, with_ties = FALSE) %>%
        ungroup()
        res <- as.data.frame(res)
        res <- res[order(res$padj, -res$log2FoldChange), ]
        eggnog_data <- read.delim("~/DATA/Data_Tam_RNAseq_2024/eggnog_out.emapper.annotations.txt", header = TRUE, sep = "\t")
        res$GeneID <- gsub("gene-", "", res$GeneID)
        res <- res %>%
        left_join(eggnog_data, by = c("GeneID" = "query"))
        res_updated <- res %>%
        left_join(go_terms, by = "GeneID") %>%
        left_join(ec_terms, by = "GeneID") %>% dplyr::select(-EC.x, -GOs.x) %>% dplyr::rename(EC = EC.y, GOs = GOs.y)
        up_regulated_AUM_vs_MHB <- res_updated[res_updated$log2FoldChange > 2 & res_updated$padj < 0.01, ]
        down_regulated_AUM_vs_MHB <- res_updated[res_updated$log2FoldChange < -2 & res_updated$padj < 0.01, ]
    
        res <- read.csv("Urine_vs_AUM-all.csv")    #up155, down105
        res$GeneName <- ifelse(
            res$GeneName == "" | is.na(res$GeneName),
            gsub("gene-", "", res$GeneID),
            res$GeneName
        )
        duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
        res <- res %>%
        group_by(GeneName) %>%
        slice_min(padj, with_ties = FALSE) %>%
        ungroup()
        res <- as.data.frame(res)
        res <- res[order(res$padj, -res$log2FoldChange), ]
        eggnog_data <- read.delim("~/DATA/Data_Tam_RNAseq_2024/eggnog_out.emapper.annotations.txt", header = TRUE, sep = "\t")
        res$GeneID <- gsub("gene-", "", res$GeneID)
        res <- res %>%
        left_join(eggnog_data, by = c("GeneID" = "query"))
        res_updated <- res %>%
        left_join(go_terms, by = "GeneID") %>%
        left_join(ec_terms, by = "GeneID") %>% dplyr::select(-EC.x, -GOs.x) %>% dplyr::rename(EC = EC.y, GOs = GOs.y)
        up_regulated_Urine_vs_AUM <- res_updated[res_updated$log2FoldChange > 2 & res_updated$padj < 0.01, ]
        down_regulated_Urine_vs_AUM <- res_updated[res_updated$log2FoldChange < -2 & res_updated$padj < 0.01, ]
    
        res <- read.csv("Urine_vs_MHB-all.csv")   #up259, down138
        res$GeneName <- ifelse(
            res$GeneName == "" | is.na(res$GeneName),
            gsub("gene-", "", res$GeneID),
            res$GeneName
        )
        duplicated_genes <- res[duplicated(res$GeneName), "GeneName"]
        res <- res %>%
        group_by(GeneName) %>%
        slice_min(padj, with_ties = FALSE) %>%
        ungroup()
        res <- as.data.frame(res)
        res <- res[order(res$padj, -res$log2FoldChange), ]
        eggnog_data <- read.delim("~/DATA/Data_Tam_RNAseq_2024/eggnog_out.emapper.annotations.txt", header = TRUE, sep = "\t")
        res$GeneID <- gsub("gene-", "", res$GeneID)
        res <- res %>%
        left_join(eggnog_data, by = c("GeneID" = "query"))
        res_updated <- res %>%
        left_join(go_terms, by = "GeneID") %>%
        left_join(ec_terms, by = "GeneID") %>% dplyr::select(-EC.x, -GOs.x) %>% dplyr::rename(EC = EC.y, GOs = GOs.y)
        up_regulated_Urine_vs_MHB <- res_updated[res_updated$log2FoldChange > 2 & res_updated$padj < 0.01, ]
        down_regulated_Urine_vs_MHB <- res_updated[res_updated$log2FoldChange < -2 & res_updated$padj < 0.01, ]
    
        # 1. 生成基因集
        AUM_DEGs <- unique(c(up_regulated_AUM_vs_MHB$GeneID, down_regulated_AUM_vs_MHB$GeneID))
        Urine_DEGs <- unique(c(up_regulated_Urine_vs_AUM$GeneID, down_regulated_Urine_vs_AUM$GeneID))
        MHB_DEGs <- unique(c(up_regulated_Urine_vs_MHB$GeneID, down_regulated_Urine_vs_MHB$GeneID))
    
        # 2. 生成 Venn 图
        venn_plot <- ggvenn(
        list(AUM = AUM_DEGs, Urine = Urine_DEGs, MHB = MHB_DEGs),
        fill_color = c("blue", "red", "green"),  # 颜色设定
        stroke_size = 0.5,  # 线条粗细
        set_name_size = 5  # 组名字体大小
        )
    
        # 3. 保存 Venn 图为 PNG
        ggsave("Venn_Diagram.png", plot = venn_plot, width = 6, height = 6, dpi = 300)
    
        # 4. 计算各个集合的交集
        AUM_Urine <- intersect(AUM_DEGs, Urine_DEGs)
        AUM_MHB <- intersect(AUM_DEGs, MHB_DEGs)
        Urine_MHB <- intersect(Urine_DEGs, MHB_DEGs)
        AUM_Urine_MHB <- Reduce(intersect, list(AUM_DEGs, Urine_DEGs, MHB_DEGs))
    
        # 5. 保存基因列表到 Excel 文件
        #venn_gene_list <- list(
        #  "AUM_DEGs" = AUM_DEGs,
        #  "Urine_DEGs" = Urine_DEGs,
        #  "MHB_DEGs" = MHB_DEGs,
        #  "AUM_Urine" = AUM_Urine,
        #  "AUM_MHB" = AUM_MHB,
        #  "Urine_MHB" = Urine_MHB,
        #  "AUM_Urine_MHB" = AUM_Urine_MHB
        #)
        #write.xlsx(venn_gene_list, file = "Venn_GeneLists.xlsx", rowNames = FALSE)
    
        # 创建 Excel 文件,存储各个 Venn 区域的基因
        venn_list <- list(
        "AUM_only" = setdiff(AUM_DEGs, c(Urine_DEGs, MHB_DEGs)),
        "Urine_only" = setdiff(Urine_DEGs, c(AUM_DEGs, MHB_DEGs)),
        "MHB_only" = setdiff(MHB_DEGs, c(AUM_DEGs, Urine_DEGs)),
        "AUM_Urine" = intersect(AUM_DEGs, Urine_DEGs),
        "AUM_MHB" = intersect(AUM_DEGs, MHB_DEGs),
        "Urine_MHB" = intersect(Urine_DEGs, MHB_DEGs),
        "All_three" = Reduce(intersect, list(AUM_DEGs, Urine_DEGs, MHB_DEGs))
        )
        # 写入 Excel
        write.xlsx(venn_list, file = "Venn_Diagram_Genes.xlsx")
    
        # 打印完成信息
        print("Venn diagram saved as 'Venn_Diagram.png' and gene lists saved in 'Venn_Diagram_Genes.xlsx'.")

Ute RNA-Seq Data Methods

PCA_WaGa.png

PCA_MKL1.png

WaGa_wt.EV_vs_parental.png

MKL-1_wt.EV_vs_parental.png

DEGs_Heatmap_WaGa.png

DEGs_Heatmap_MKL-1.png

rseqc_read_distribution_plot_WaGa.png

rseqc_read_distribution_plot_MKL-1.png

PCA plot untreated/wt vs parental cells; 1x für WaGa cell line und 1x für MKL-1 cells

Heatmap untreated/wt vs parental; 1x für WaGa cell line und 1x für MKL-1 cells

Volcano plot untreated/wt vs parental; 1x für WaGa cell line und 1x für MKL-1 cells

Distribution of different RNA Species untreated/wt and parental; 1x für WaGa cell line und 1x für MKL-1 cells

RNA fragmentation patterns: is EV-RNA full length or fragmented?

RNA binding protein motifs: do we find specific motifs in EV-RNA?

miRNA target analysis

  1. Input files (use R 4.3.3)

        ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/merged_gene_counts_40samples.txt
        ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/merged_gene_counts_virus_rounded.txt
  2. Common processing for the data MKL+1 + WaGa

        #[1] "/home/jhuang/mambaforge/envs/r_env/lib/R/library"
    
        setwd("/home/jhuang/DATA/Data_Ute/Data_RNA-Seq_MKL-1_WaGa/results_2025_1/")
    
        library("AnnotationDbi")
        library("clusterProfiler")
        library("ReactomePA")
        #library("org.Mm.eg.db")
        library(DESeq2)
        library(gplots)
    
        d.raw_human<- read.delim2("../merged_gene_counts_40samples.txt",sep="\t", header=TRUE, row.names=1)
        colnames(d.raw_human)<- c("gene_name","X042_MKL.1_wt_EV","MKL.1_RNA_147","X042_MKL.1_sT_DMSO","MKL.1_RNA","X0505_MKL.1_scr_DMSO_EV","X0505_MKL.1_sT_DMSO_EV","MKL.1_EV.RNA_87","MKL.1_EV.RNA_27","X042_MKL.1_scr_Dox_EV","X042_MKL.1_scr_DMSO_EV","MKL.1_EV.RNA_118","X0505_MKL.1_scr_Dox_EV","MKL.1_EV.RNA","X042_MKL.1_sT_Dox","X0505_MKL.1_sT_Dox_EV","MKL.1_RNA_118","MKL.1_EV.RNA_2",      "Geneid.1","gene_name.1","X1605_WaGa_sT_DMSO_EV","WaGa_EV.RNA_118","WaGa_EV.RNA","X2706_WaGa_scr_Dox_EV","X2706_WaGa_sT_DMSO_EV","X1107_WaGa_wt_EV","X1107_WaGa_sT_Dox_EV","WaGa_RNA","X1605_WaGa_scr_DMSO_EV","X2706_WaGa_scr_DMSO_EV","WaGa_EV.RNA_226","X2706_WaGa_sT_Dox_EV","X1605_WaGa_scr_Dox_EV","X1605_WaGa_wt_EV","X1605_WaGa_sT_Dox_EV","X1107_WaGa_scr_DMSO_EV","WaGa_EV.RNA_2","WaGa_EV.RNA_147","WaGa_RNA_118","X1107_WaGa_scr_Dox_EV","X1107_WaGa_sT_DMSO_EV","WaGa_RNA_147","X2706_WaGa_wt_EV")
    
        col_order <- c("gene_name",  "MKL.1_RNA","MKL.1_RNA_118","MKL.1_RNA_147","MKL.1_EV.RNA","MKL.1_EV.RNA_2","MKL.1_EV.RNA_118","MKL.1_EV.RNA_87","MKL.1_EV.RNA_27","X042_MKL.1_wt_EV","X042_MKL.1_sT_DMSO","X0505_MKL.1_sT_DMSO_EV","X042_MKL.1_scr_DMSO_EV","X0505_MKL.1_scr_DMSO_EV","X042_MKL.1_sT_Dox","X0505_MKL.1_sT_Dox_EV","X042_MKL.1_scr_Dox_EV","X0505_MKL.1_scr_Dox_EV",     "Geneid.1","gene_name.1",    "WaGa_RNA","WaGa_RNA_118","WaGa_RNA_147",    "WaGa_EV.RNA","WaGa_EV.RNA_2","WaGa_EV.RNA_118","WaGa_EV.RNA_147","WaGa_EV.RNA_226","X1107_WaGa_wt_EV","X1605_WaGa_wt_EV","X2706_WaGa_wt_EV",    "X1107_WaGa_sT_DMSO_EV","X1605_WaGa_sT_DMSO_EV","X2706_WaGa_sT_DMSO_EV",    "X1107_WaGa_scr_DMSO_EV","X1605_WaGa_scr_DMSO_EV","X2706_WaGa_scr_DMSO_EV",     "X1107_WaGa_sT_Dox_EV","X1605_WaGa_sT_Dox_EV","X2706_WaGa_sT_Dox_EV",     "X1107_WaGa_scr_Dox_EV","X1605_WaGa_scr_Dox_EV","X2706_WaGa_scr_Dox_EV")
        reordered.raw_human <- d.raw_human[,col_order]
    
        d.raw_virus <- read.delim2("../merged_gene_counts_virus_rounded.txt",sep="\t", header=TRUE, row.names=1)
        reordered.raw_virus <- d.raw_virus[,col_order]
    
        identical(colnames(reordered.raw_human), colnames(reordered.raw_virus))
    
        reordered.raw <- rbind(reordered.raw_human, reordered.raw_virus)
    
        #rename
        colnames(reordered.raw) <- c("gene_name", "MKL-1 parental cell RNA","MKL-1 parental cell RNA 118","MKL-1 parental cell RNA 147",    "MKL-1 wt EV RNA","MKL-1 wt EV RNA 2","MKL-1 wt EV RNA 118","MKL-1 wt EV RNA 87","MKL-1 wt EV RNA 27","MKL-1 wt EV RNA 042",    "MKL-1 sT DMSO EV RNA 042","MKL-1 sT DMSO EV RNA 0505",  "MKL-1 scr DMSO EV RNA 042","MKL-1 scr DMSO EV RNA 0505",  "MKL-1 sT Dox EV RNA 042","MKL-1 sT Dox EV RNA 0505",  "MKL-1 scr Dox EV RNA 042","MKL-1 scr Dox EV RNA 0505",   "Geneid.1","gene_name.1",    "WaGa parental cell RNA","WaGa parental cell RNA 118","WaGa parental cell RNA 147",    "WaGa wt EV RNA","WaGa wt EV RNA 2","WaGa wt EV RNA 118","WaGa wt EV RNA 147","WaGa wt EV RNA 226","WaGa wt EV RNA 1107","WaGa wt EV RNA 1605","WaGa wt EV RNA 2706",    "WaGa sT DMSO EV RNA 1107","WaGa sT DMSO EV RNA 1605","WaGa sT DMSO EV RNA 2706",     "WaGa scr DMSO EV RNA 1107","WaGa scr DMSO EV RNA 1605","WaGa scr DMSO EV RNA 2706",     "WaGa sT Dox EV RNA 1107","WaGa sT Dox EV RNA 1605","WaGa sT Dox EV RNA 2706",     "WaGa scr Dox EV RNA 1107","WaGa scr Dox EV RNA 1605","WaGa scr Dox EV RNA 2706")
    
        reordered.raw$gene_name <- NULL
        reordered.raw$Geneid.1 <- NULL
        reordered.raw$gene_name.1 <- NULL
        write.csv(reordered.raw, file="counts.txt")
        #IMPORTANT that we should filter the data with the counts in the STEP!
        d <- reordered.raw[rowSums(reordered.raw>3)>2,]
    
        condition_for_pca = as.factor(c("RNA","RNA","RNA","EV","EV","EV","EV","EV","EV","sT.DMSO","sT.DMSO","scr.DMSO","scr.DMSO","sT.Dox","sT.Dox","scr.Dox","scr.Dox",    "RNA","RNA","RNA","EV","EV","EV","EV","EV","EV","EV","EV","sT.DMSO","sT.DMSO","sT.DMSO","scr.DMSO","scr.DMSO","scr.DMSO","sT.Dox","sT.Dox","sT.Dox","scr.Dox","scr.Dox","scr.Dox"))
    
        condition = as.factor(c("MKL1.RNA","MKL1.RNA","MKL1.RNA","MKL1.EV","MKL1.EV","MKL1.EV","MKL1.EV","MKL1.EV","MKL1.EV","MKL1.sT.DMSO","MKL1.sT.DMSO","MKL1.scr.DMSO","MKL1.scr.DMSO","MKL1.sT.Dox","MKL1.sT.Dox","MKL1.scr.Dox","MKL1.scr.Dox",    "WaGa.RNA","WaGa.RNA","WaGa.RNA","WaGa.EV","WaGa.EV","WaGa.EV","WaGa.EV","WaGa.EV","WaGa.EV","WaGa.EV","WaGa.EV","WaGa.sT.DMSO","WaGa.sT.DMSO","WaGa.sT.DMSO","WaGa.scr.DMSO","WaGa.scr.DMSO","WaGa.scr.DMSO","WaGa.sT.Dox","WaGa.sT.Dox","WaGa.sT.Dox","WaGa.scr.Dox","WaGa.scr.Dox","WaGa.scr.Dox"))
    
        donor = as.factor(c("1","118","147",  "1","2","118","87","27","042",  "042","0505","042","0505","042","0505","042","0505",    "1","118","147",  "1","2","118","147","226","1107","1605","2706",   "1107","1605","2706","1107","1605","2706","1107","1605","2706","1107","1605","2706"))
    
        batch = as.factor(c("2021.08","2021.09","2021.09","2021.08","2021.08","2021.09","2021.09","2021.09","2022.08","2022.08","2022.08","2022.08","2022.08","2022.08","2022.08","2022.08","2022.08",    "2021.08","2021.09","2021.09","2021.08","2021.08","2021.09","2021.09","2021.09","2022.11","2022.11","2022.11","2022.11","2022.11","2022.11","2022.11","2022.11","2022.11","2022.11","2022.11","2022.11","2022.11","2022.11","2022.11"))
    
        cell.line = as.factor(c("MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1",    "WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa"))
    
        ids = as.factor(c("MKL1.RNA","MKL1.RNA.118","MKL1.RNA.147",  "MKL1.EV","MKL1.EV.2","MKL1.EV.118","MKL1.EV.87","MKL1.EV.27","MKL1.EV.042",  "MKL1.EV.sT.DMSO.042","MKL1.EV.sT.DMSO.0505",  "MKL1.EV.scr.DMSO.042","MKL1.EV.scr.DMSO.0505",  "MKL1.EV.sT.Dox.042","MKL1.EV.sT.Dox.0505",  "MKL1.EV.scr.Dox.042","MKL1.EV.scr.Dox.0505",      "WaGa.RNA","WaGa.RNA.118","WaGa.RNA.147",  "WaGa.EV","WaGa.EV.2","WaGa.EV.118","WaGa.EV.147","WaGa.EV.226","WaGa.EV.1107","WaGa.EV.1605","WaGa.EV.2706",  "WaGa.EV.sT.DMSO.1107","WaGa.EV.sT.DMSO.1605","WaGa.EV.sT.DMSO.2706",  "WaGa.EV.scr.DMSO.1107","WaGa.EV.scr.DMSO.1605","WaGa.EV.scr.DMSO.2706",  "WaGa.EV.sT.Dox.1107","WaGa.EV.sT.Dox.1605","WaGa.EV.sT.Dox.2706",  "WaGa.EV.scr.Dox.1107","WaGa.EV.scr.Dox.1605","WaGa.EV.scr.Dox.2706"))
    
        cData = data.frame(row.names=colnames(d), condition=condition,  donor=donor, batch=batch, cell.line=cell.line, ids=ids)
        dds<-DESeqDataSetFromMatrix(countData=d, colData=cData, design=~batch+condition)
    
        #rld <- rlogTransformation(dds)
        rld <- vst(dds)
  3. Preparing the data for PCA_MKL1 and PCA_WaGa drawing

        #d_WaGa <- d[, !grepl("parental|MKL-1", names(d))]
        d_WaGa <- d[, !grepl("MKL-1", names(d))]
        condition = as.factor(c("WaGa.RNA","WaGa.RNA","WaGa.RNA","WaGa.EV","WaGa.EV","WaGa.EV","WaGa.EV","WaGa.EV","WaGa.EV","WaGa.EV","WaGa.EV","WaGa.sT.DMSO","WaGa.sT.DMSO","WaGa.sT.DMSO","WaGa.scr.DMSO","WaGa.scr.DMSO","WaGa.scr.DMSO","WaGa.sT.Dox","WaGa.sT.Dox","WaGa.sT.Dox","WaGa.scr.Dox","WaGa.scr.Dox","WaGa.scr.Dox"))
        donor = as.factor(c("1","118","147","1","2","118","147","226","1107","1605","2706",   "1107","1605","2706","1107","1605","2706","1107","1605","2706","1107","1605","2706"))
        batch = as.factor(c("2021.08","2021.09","2021.09","2021.08","2021.08","2021.09","2021.09","2021.09","2022.11","2022.11","2022.11","2022.11","2022.11","2022.11","2022.11","2022.11","2022.11","2022.11","2022.11","2022.11","2022.11","2022.11","2022.11"))
        cell.line = as.factor(c("WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa","WaGa"))
        ids = as.factor(c("WaGa.RNA","WaGa.RNA.118","WaGa.RNA.147","WaGa.EV","WaGa.EV.2","WaGa.EV.118","WaGa.EV.147","WaGa.EV.226","WaGa.EV.1107","WaGa.EV.1605","WaGa.EV.2706",  "WaGa.EV.sT.DMSO.1107","WaGa.EV.sT.DMSO.1605","WaGa.EV.sT.DMSO.2706",  "WaGa.EV.scr.DMSO.1107","WaGa.EV.scr.DMSO.1605","WaGa.EV.scr.DMSO.2706",  "WaGa.EV.sT.Dox.1107","WaGa.EV.sT.Dox.1605","WaGa.EV.sT.Dox.2706",  "WaGa.EV.scr.Dox.1107","WaGa.EV.scr.Dox.1605","WaGa.EV.scr.Dox.2706"))
        cData = data.frame(row.names=colnames(d_WaGa), condition=condition,  donor=donor, batch=batch, cell.line=cell.line, ids=ids)
        dds_WaGa<-DESeqDataSetFromMatrix(countData=d_WaGa, colData=cData, design=~batch+condition)
        rld_WaGa <- vst(dds_WaGa)
    
        #d_MKL1 <- d[, !grepl("parental|WaGa", names(d))]
        d_MKL1 <- d[, !grepl("WaGa", names(d))]
        condition = as.factor(c("MKL1.RNA","MKL1.RNA","MKL1.RNA","MKL1.EV","MKL1.EV","MKL1.EV","MKL1.EV","MKL1.EV","MKL1.EV","MKL1.sT.DMSO","MKL1.sT.DMSO","MKL1.scr.DMSO","MKL1.scr.DMSO","MKL1.sT.Dox","MKL1.sT.Dox","MKL1.scr.Dox","MKL1.scr.Dox"))
        donor = as.factor(c("1","118","147","1","2","118","87","27","042",  "042","0505","042","0505","042","0505","042","0505"))
        batch = as.factor(c("2021.08","2021.09","2021.09","2021.08","2021.08","2021.09","2021.09","2021.09","2022.08","2022.08","2022.08","2022.08","2022.08","2022.08","2022.08","2022.08","2022.08"))
        cell.line = as.factor(c("MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1","MKL-1"))
        ids = as.factor(c("MKL1.RNA","MKL1.RNA.118","MKL1.RNA.147", "MKL1.EV","MKL1.EV.2","MKL1.EV.118","MKL1.EV.87","MKL1.EV.27","MKL1.EV.042",  "MKL1.EV.sT.DMSO.042","MKL1.EV.sT.DMSO.0505",  "MKL1.EV.scr.DMSO.042","MKL1.EV.scr.DMSO.0505",  "MKL1.EV.sT.Dox.042","MKL1.EV.sT.Dox.0505",  "MKL1.EV.scr.Dox.042","MKL1.EV.scr.Dox.0505"))
        cData = data.frame(row.names=colnames(d_MKL1), condition=condition,  donor=donor, batch=batch, cell.line=cell.line, ids=ids)
        dds_MKL1<-DESeqDataSetFromMatrix(countData=d_MKL1, colData=cData, design=~batch+condition)
        rld_MKL1 <- vst(dds_MKL1)
    
        # -- before pca --
        png("pca.png", 1200, 800)
        plotPCA(rld, intgroup=c("condition"))
        #plotPCA(rld, intgroup = c("condition", "batch"))
        #plotPCA(rld, intgroup = c("condition", "ids"))
        #plotPCA(rld, "batch")
        dev.off()
        #72% (PC1), 11% (PC2)   8% (PC3)
    
        png("pca_WaGa.png", 1200, 800)
        plotPCA(rld_WaGa, intgroup = c("condition"))
        dev.off()
        #80%, 9%
    
        png("pca_MKL1.png", 1200, 800)
        plotPCA(rld_MKL1, intgroup = c("condition"))
        dev.off()
        #77%, 15%
  4. PCA plot untreated/wt vs parental cells; 1x für WaGa cell line und 1x für MKL-1 cells

        #install.packages("BiocManager")
        #BiocManager::install("genefilter")
        #install.packages("writexl")
        #install.packages("plotly")
        #install.packages("dplyr")
        #install.packages("ggrepel")
        # Load required libraries
        library(ggrepel)
        library(dplyr)
        library(ggplot2)
        library(genefilter)
        library(writexl)
    
        # ---- drawing PCA WaGa ----
    
        # Perform PCA on WaGa dataset
        pca_data <- plotPCA(rld_WaGa, intgroup = c("condition", "batch", "cell.line"), returnData = TRUE)
        write.csv(pca_data, file = "PCA_data_WaGa.csv")
    
        # Compute top variable genes and perform PCA
        top_variable_genes <- 500
        gene_variance <- rowVars(assay(rld_WaGa))
        selected_genes <- order(gene_variance, decreasing = TRUE)[seq_len(min(top_variable_genes, length(gene_variance)))]
    
        # Extract data and run PCA
        expression_matrix <- t(assay(rld_WaGa)[selected_genes, ])
        pca_results <- prcomp(expression_matrix, center = TRUE, scale. = TRUE)
        summary(pca_results)  # Display variance explained by PCs
    
        # Extract variance percentages
        variance_explained <- summary(pca_results)$importance[2, ]
        pc1_variance <- sprintf("%.1f", variance_explained[1] * 100)
        pc2_variance <- sprintf("%.1f", variance_explained[2] * 100)
    
        # Create a PCA dataframe
        pca_df <- as.data.frame(pca_results$x)
        rownames(pca_df) <- rownames(pca_data)
    
        # Merge PCA results with original metadata
        pca_data <- pca_data %>%
        select(-PC1, -PC2)  # Remove old PCA values
    
        merged_pca_data <- merge(pca_data, pca_df, by = "row.names")
        rownames(merged_pca_data) <- merged_pca_data$Row.names
        merged_pca_data$Row.names <- NULL
    
        # Save merged PCA data
        write.csv(merged_pca_data, file = "Merged_PCA_WaGa.csv")
        write_xlsx(merged_pca_data, path = "Merged_PCA_WaGa.xlsx")
    
        # Define condition labels
        condition_labels <- c(
              "WaGa.EV" = "wt EV RNA",
              "WaGa.scr.DMSO" = "scr DMSO EV RNA",
              "WaGa.scr.Dox" = "scr Dox EV RNA",
              "WaGa.sT.DMSO" = "sT DMSO EV RNA",
              "WaGa.sT.Dox" = "sT Dox EV RNA",
              "WaGa.RNA" = "parental cell RNA"
        )
    
        # Apply condition labels and filter relevant conditions
        filtered_pca_data <- merged_pca_data %>%
        mutate(condition = recode(condition, !!!condition_labels)) %>%
        filter(condition %in% c("wt EV RNA", "parental cell RNA"))
    
        # Prepare PCA data for plotting
        plot_pca_df <- as.data.frame(filtered_pca_data[, c("PC1", "PC2")])
        plot_pca_df$condition <- filtered_pca_data$condition
        plot_pca_df$sample_name <- rownames(filtered_pca_data)
    
        # Save PCA plot as PNG
        png("PCA_WaGa.png", width = 1000, height = 600, res = 150)
    
        ggplot(plot_pca_df, aes(x = PC1, y = PC2, color = condition)) +
        geom_point(size = 4, alpha = 0.8) +  # Plot points
        labs(
        title = "",
        x = paste0("Principal Component 1 (", pc1_variance, "%)"),
        y = paste0("Principal Component 2 (", pc2_variance, "%)")
        ) +
        theme_minimal() +
        scale_color_manual(values = c("blue", "red"))  # Customize colors
    
        dev.off()  # Close the PNG device and save the file
    
        # ---- drawing PCA MKL-1 ----
    
        # Compute PCA on MKL1 dataset
        pca_data <- plotPCA(rld_MKL1, intgroup = c("condition", "batch", "cell.line"), returnData = TRUE)
        write.csv(pca_data, file = "PCA_data_MKL1.csv")
    
        # Compute top variable genes and perform PCA
        top_variable_genes <- 500
        gene_variance <- rowVars(assay(rld_MKL1))
        selected_genes <- order(gene_variance, decreasing = TRUE)[seq_len(min(top_variable_genes, length(gene_variance)))]
    
        # Extract and transpose data for PCA
        expression_matrix <- t(assay(rld_MKL1)[selected_genes, ])
        pca_results <- prcomp(expression_matrix, center = TRUE, scale. = TRUE)
    
        # Extract variance explained percentages
        variance_explained <- summary(pca_results)$importance[2, ]
        pc1_variance <- sprintf("%.1f", variance_explained[1] * 100)
        pc2_variance <- sprintf("%.1f", variance_explained[2] * 100)
    
        # Create a PCA dataframe
        pca_df <- as.data.frame(pca_results$x)
        rownames(pca_df) <- rownames(pca_data)
    
        # Remove previous PC1 & PC2 columns and merge new PCA results
        pca_data <- pca_data %>%
        select(-PC1, -PC2)
    
        merged_pca_data <- merge(pca_data, pca_df, by = "row.names")
        rownames(merged_pca_data) <- merged_pca_data$Row.names
        merged_pca_data$Row.names <- NULL
    
        # Save merged PCA data
        write.csv(merged_pca_data, file = "Merged_PCA_MKL1.csv")
        write_xlsx(merged_pca_data, path = "Merged_PCA_MKL1.xlsx")
    
        # Define condition labels for clarity
        condition_labels <- c(
        "MKL1.EV" = "wt EV RNA",
        "MKL1.scr.DMSO" = "scr DMSO EV RNA",
        "MKL1.scr.Dox" = "scr Dox EV RNA",
        "MKL1.sT.DMSO" = "sT DMSO EV RNA",
        "MKL1.sT.Dox" = "sT Dox EV RNA",
        "MKL1.RNA" = "parental cell RNA"
        )
    
        # Apply condition labels and filter for relevant conditions
        filtered_pca_data <- merged_pca_data %>%
        mutate(condition = recode(condition, !!!condition_labels)) %>%
        filter(condition %in% c("wt EV RNA", "parental cell RNA"))
    
        # Prepare PCA data for plotting
        plot_pca_df <- as.data.frame(filtered_pca_data[, c("PC1", "PC2")])
        plot_pca_df$condition <- filtered_pca_data$condition
        plot_pca_df$sample_name <- rownames(filtered_pca_data)
    
        # Save PCA plot as PNG
        png("PCA_MKL1.png", width = 1000, height = 600, res = 150)
    
        ggplot(plot_pca_df, aes(x = PC1, y = PC2, color = condition)) +
        geom_point(size = 4, alpha = 0.8) +  # Plot points
        labs(
        title = "",
        x = paste0("Principal Component 1 (", pc1_variance, "%)"),
        y = paste0("Principal Component 2 (", pc2_variance, "%)")
        ) +
        theme_minimal() +
        scale_color_manual(values = c("blue", "red"))  # Customize colors
    
        dev.off()  # Close the PNG device and save the file
  5. Convert bam to bigwig using deepTools by feeding inverse of DESeq’s size Factor

        sizeFactors(dds)
        #NULL
        dds <- estimateSizeFactors(dds)
        > sizeFactors(dds)
        #WaGa parental cell RNA*
        #                  1.1242096                   2.1097851
        #WaGa parental cell RNA 118*  WaGa parental cell RNA 147*
        #                  1.6925780                   1.6712182
    
        raw_counts <- counts(dds)
        normalized_counts <- counts(dds, normalized=TRUE)
        write.table(raw_counts, file="raw_counts.txt", sep="\t", quote=F, col.names=NA)
        write.table(normalized_counts, file="normalized_counts.txt", sep="\t", quote=F, col.names=NA)
    
        #1/2,1097851=0,473981924 (!=0.4958732)
        bamCoverage --bam ../markDuplicates/WaGa_RNAAligned.sortedByCoord.out.markDups.bam -o WaGa_RNA.bw --binSize 10 --scaleFactor 0.473981924 --effectiveGenomeSize 2864785220
        #1/1,6925780=0,590814722 (!=0.6013898)
        bamCoverage --bam ../markDuplicates/WaGa_RNA_118Aligned.sortedByCoord.out.markDups.bam -o WaGa_RNA_118.bw --binSize 10 --scaleFactor 0.590814722 --effectiveGenomeSize 2864785220
        #1/1,6712182=0,598365911 (!=0.6154516)
        bamCoverage --bam ../markDuplicates/WaGa_RNA_147Aligned.sortedByCoord.out.markDups.bam -o WaGa_RNA_147.bw --binSize 10 --scaleFactor 0.598365911 --effectiveGenomeSize 2864785220
  6. DEG analysis

        ##https://bioconductor.statistik.tu-dortmund.de/packages/3.7/data/annotation/
        #BiocManager::install("EnsDb.Mmusculus.v79")
        #library(EnsDb.Mmusculus.v79)
        #edb <- EnsDb.Mmusculus.v79
        #https://bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/accessing_ensembl.html#selecting-an-ensembl-biomart-database-and-dataset
        #https://bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/accessing_ensembl.html#selecting-an-ensembl-biomart-database-and-dataset
        library(biomaRt)
        #listMarts() == listEnsembl() == Ensembl Genes 113
        listEnsembl()
        listMarts()
        listEnsemblArchives()
    
        #mkdir degenes
        setwd("degenes")
    
        dds$condition <- relevel(dds$condition, "WaGa.RNA")
        dds = DESeq(dds, betaPrior=FALSE)
        resultsNames(dds)
        clist <- c("WaGa.EV_vs_WaGa.RNA")
    
        dds$condition <- relevel(dds$condition, "MKL1.RNA")
        dds = DESeq(dds, betaPrior=FALSE)
        resultsNames(dds)
        clist <- c("MKL1.EV_vs_MKL1.RNA")
    
        #ensembl <- useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", version="112")
        #ensembl <- useMart("ensembl", dataset="hsapiens_gene_ensembl", host="https://may2023.archive.ensembl.org")
        ensembl <- useMart("ensembl", dataset="hsapiens_gene_ensembl", host="https://www.ensembl.org")
    
        listDatasets(ensembl)
        attributes = listAttributes(ensembl)
        attributes[1:25,]
    
        for (i in clist) {
              #i<-clist[1]
              contrast = paste("condition", i, sep="_")
              res = results(dds, name=contrast)
              res <- res[!is.na(res$log2FoldChange),]
              geness <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'), filters = 'ensembl_gene_id', values = rownames(res), mart = ensembl)
              geness_uniq <- distinct(geness, ensembl_gene_id, .keep_all= TRUE)
    
              #merge by column by common colunmn name, in the case "GENEID"
              res$ENSEMBL = rownames(res)
              identical(rownames(res), rownames(geness_uniq))
              res_df <- as.data.frame(res)
              geness_res <- merge(geness_uniq, res_df, by.x="ensembl_gene_id", by.y="ENSEMBL")
              dim(geness_res)
              rownames(geness_res) <- geness_res$ensembl_gene_id
              geness_res$ensembl_gene_id <- NULL
              write.csv(as.data.frame(geness_res[order(geness_res$pvalue),]), file = paste(i, "all.txt", sep="-"))
              up <- subset(geness_res, padj<=0.05 & log2FoldChange>=2)
              down <- subset(geness_res, padj<=0.05 & log2FoldChange<=-2)
              write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
              write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
        }
  7. Volcano plot untreated/wt vs parental; 1x für WaGa cell line und 1x für MKL-1 cells

        #A canonical visualization for interpreting differential gene expression results is the volcano plot.
        library(ggrepel)
    
        # -- WaGa.EV_vs_WaGa.RNA --
        geness_res <- read.csv(file = paste("WaGa.EV_vs_WaGa.RNA", "all.txt", sep="-"), row.names=1)
        geness_res$Color <- "NS or log2FC < 2.0"
        geness_res$Color[geness_res$pvalue < 0.05] <- "P < 0.05"
        geness_res$Color[geness_res$padj < 0.05] <- "P-adj < 0.05"
        geness_res$Color[geness_res$padj < 0.001] <- "P-adj < 0.001"
        geness_res$Color[abs(geness_res$log2FoldChange) < 2.0] <- "NS or log2FC < 2.0"
        geness_res$Color <- factor(geness_res$Color,
                                levels = c("NS or log2FC < 2.0", "P < 0.05",
                                      "P-adj < 0.05", "P-adj < 0.001"))
    
        geness_res$invert_P <- (-log10(geness_res$pvalue)) * sign(geness_res$log2FoldChange)
        top_g <- c()
        top_g <- c(top_g, geness_res[, 'external_gene_name'][order(geness_res[, 'invert_P'], decreasing = TRUE)[1:200]], geness_res[, 'external_gene_name'][order(geness_res[, 'invert_P'], decreasing = FALSE)[1:200]])
        top_g <- unique(top_g)
        geness_res <- geness_res[, -1*ncol(geness_res)]  #remove invert_P from matrix
    
        png("WaGa_wt.EV_vs_parental.png",width=1400, height=1000)
        ggplot(geness_res,
              aes(x = log2FoldChange, y = -log10(pvalue),
              color = Color, label = external_gene_name)) +
        geom_vline(xintercept = c(2.0, -2.0), lty = "dashed") +
        geom_hline(yintercept = -log10(0.05), lty = "dashed") +
        geom_point() +
        labs(x = "log2(FC)",
              y = "Significance, -log10(P)",
              color = "Significance") +
        scale_color_manual(values = c(`P-adj < 0.001` = "dodgerblue",
                                      `P-adj < 0.05` = "lightblue",
                                      `P < 0.05` = "orange2",
                                      `NS or log2FC < 2.0` = "gray"),
                          guide = guide_legend(override.aes = list(size = 4))) +
        scale_y_continuous(expand = expansion(mult = c(0,0.05))) +
        geom_text_repel(data = subset(geness_res, external_gene_name %in% top_g & pvalue < 0.05 & (abs(geness_res$log2FoldChange) >= 2.0)),
                          size = 4, point.padding = 0.15, color = "black",
                          min.segment.length = .1, box.padding = .2, lwd = 2) +
        theme_bw(base_size = 16) +
        theme(legend.position = "bottom")
        dev.off()
    
        # -- MKL1.EV_vs_MKL1.RNA --
        geness_res <- read.csv(file = paste("MKL1.EV_vs_MKL1.RNA", "all.txt", sep="-"), row.names=1)
        geness_res$Color <- "NS or log2FC < 2.0"
        geness_res$Color[geness_res$pvalue < 0.05] <- "P < 0.05"
        geness_res$Color[geness_res$padj < 0.05] <- "P-adj < 0.05"
        geness_res$Color[geness_res$padj < 0.001] <- "P-adj < 0.001"
        geness_res$Color[abs(geness_res$log2FoldChange) < 2.0] <- "NS or log2FC < 2.0"
        geness_res$Color <- factor(geness_res$Color,
                                levels = c("NS or log2FC < 2.0", "P < 0.05",
                                      "P-adj < 0.05", "P-adj < 0.001"))
    
        geness_res$invert_P <- (-log10(geness_res$pvalue)) * sign(geness_res$log2FoldChange)
        top_g <- c()
        top_g <- c(top_g, geness_res[, 'external_gene_name'][order(geness_res[, 'invert_P'], decreasing = TRUE)[1:200]], geness_res[, 'external_gene_name'][order(geness_res[, 'invert_P'], decreasing = FALSE)[1:200]])
        top_g <- unique(top_g)
        geness_res <- geness_res[, -1*ncol(geness_res)]  #remove invert_P from matrix
    
        png("MKL-1_wt.EV_vs_parental.png",width=1400, height=1000)
        ggplot(geness_res,
              aes(x = log2FoldChange, y = -log10(pvalue),
              color = Color, label = external_gene_name)) +
        geom_vline(xintercept = c(2.0, -2.0), lty = "dashed") +
        geom_hline(yintercept = -log10(0.05), lty = "dashed") +
        geom_point() +
        labs(x = "log2(FC)",
              y = "Significance, -log10(P)",
              color = "Significance") +
        scale_color_manual(values = c(`P-adj < 0.001` = "dodgerblue",
                                      `P-adj < 0.05` = "lightblue",
                                      `P < 0.05` = "orange2",
                                      `NS or log2FC < 2.0` = "gray"),
                          guide = guide_legend(override.aes = list(size = 4))) +
        scale_y_continuous(expand = expansion(mult = c(0,0.05))) +
        geom_text_repel(data = subset(geness_res, external_gene_name %in% top_g & pvalue < 0.05 & (abs(geness_res$log2FoldChange) >= 2.0)),
                          size = 4, point.padding = 0.15, color = "black",
                          min.segment.length = .1, box.padding = .2, lwd = 2) +
        theme_bw(base_size = 16) +
        theme(legend.position = "bottom")
        dev.off()
    
        # -- under console --
        mv WaGa.EV_vs_WaGa.RNA-all.txt WaGa_wt.EV_vs_parental-all.txt
        mv WaGa.EV_vs_WaGa.RNA-up.txt WaGa_wt.EV_vs_parental-up.txt
        mv WaGa.EV_vs_WaGa.RNA-down.txt WaGa_wt.EV_vs_parental-down.txt
        mv MKL1.EV_vs_MKL1.RNA-all.txt MKL-1_wt.EV_vs_parental-all.txt
        mv MKL1.EV_vs_MKL1.RNA-up.txt MKL-1_wt.EV_vs_parental-up.txt
        mv MKL1.EV_vs_MKL1.RNA-down.txt MKL-1_wt.EV_vs_parental-down.txt
        for cmp in "WaGa_wt.EV_vs_parental" "MKL-1_wt.EV_vs_parental"; do
              echo "~/Tools/csv2xls-0.4/csv_to_xls.py ${cmp}-all.txt ${cmp}-up.txt ${cmp}-down.txt -d$',' -o ${cmp}.xls"
        done
        ~/Tools/csv2xls-0.4/csv_to_xls.py WaGa_wt.EV_vs_parental-all.txt WaGa_wt.EV_vs_parental-up.txt WaGa_wt.EV_vs_parental-down.txt -d$',' -o WaGa_wt.EV_vs_parental.xls
        ~/Tools/csv2xls-0.4/csv_to_xls.py MKL-1_wt.EV_vs_parental-all.txt MKL-1_wt.EV_vs_parental-up.txt MKL-1_wt.EV_vs_parental-down.txt -d$',' -o MKL-1_wt.EV_vs_parental.xls
  8. Heatmap untreated/wt vs parental; 1x für WaGa cell line und 1x für MKL-1 cells

        install.packages("gplots")
        library("gplots")
    
        # -- WaGa cell line --
        cut -d',' -f1-1 ./WaGa_wt.EV_vs_parental-up.txt > WaGa_wt.EV_vs_parental-up.id
        cut -d',' -f1-1 ./WaGa_wt.EV_vs_parental-down.txt > WaGa_wt.EV_vs_parental-down.id
        cat WaGa_wt.EV_vs_parental-up.id WaGa_wt.EV_vs_parental-down.id | sort -u > ids  #24207
    
        #add Gene_Id in the first line.
        GOI <- read.csv("ids")$Gene_Id
        RNASeq.NoCellLine <- assay(rld)
    
        #clustering methods: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).  pearson or spearman
        datamat = RNASeq.NoCellLine[GOI, c("WaGa parental cell RNA","WaGa parental cell RNA 118","WaGa parental cell RNA 147", "WaGa wt EV RNA","WaGa wt EV RNA 2","WaGa wt EV RNA 118","WaGa wt EV RNA 147","WaGa wt EV RNA 226",  "WaGa wt EV RNA 1107","WaGa wt EV RNA 1605","WaGa wt EV RNA 2706")]
        hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="average")
        hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="average")
        mycl = cutree(hr, h=max(hr$height)/1.05)
        mycol = c("YELLOW", "DARKBLUE", "DARKORANGE", "DARKMAGENTA", "DARKCYAN", "DARKRED",  "MAROON", "DARKGREEN", "LIGHTBLUE", "PINK", "MAGENTA", "LIGHTCYAN","LIGHTGREEN", "BLUE", "ORANGE", "CYAN", "RED", "GREEN");
    
        mycol = mycol[as.vector(mycl)]
        sampleCols <- rep('GREY',ncol(datamat))
        names(sampleCols) <- c("WaGa parental cell RNA","WaGa parental cell RNA 118","WaGa parental cell RNA 147", "WaGa wt EV RNA","WaGa wt EV RNA 2","WaGa wt EV RNA 118","WaGa wt EV RNA 147","WaGa wt EV RNA 226",  "WaGa wt EV RNA 1107","WaGa wt EV RNA 1605","WaGa wt EV RNA 2706")
    
        #DARKBLUE DARKGREEN DARKRED DARKORANGE
        sampleCols["WaGa parental cell RNA"] <- 'GREY'
        sampleCols["WaGa parental cell RNA 118"] <- 'GREY'
        sampleCols["WaGa parental cell RNA 147"] <- 'GREY'
        sampleCols["WaGa wt EV RNA"] <- 'GREEN'
        sampleCols["WaGa wt EV RNA 2"] <- 'GREEN'
        sampleCols["WaGa wt EV RNA 118"] <- 'GREEN'
        sampleCols["WaGa wt EV RNA 147"] <- 'GREEN'
        sampleCols["WaGa wt EV RNA 226"] <- 'GREEN'
        sampleCols["WaGa wt EV RNA 1107"] <- 'GREEN'
        sampleCols["WaGa wt EV RNA 1605"] <- 'GREEN'
        sampleCols["WaGa wt EV RNA 2706"] <- 'GREEN'
    
        png("DEGs_Heatmap_WaGa.png", width=1000, height=1200)
        heatmap.2(as.matrix(datamat),Rowv=as.dendrogram(hr),Colv = NA, dendrogram = 'row',
                    scale='row',trace='none',col=bluered(75),
                    RowSideColors = mycol, ColSideColors = sampleCols, labRow="", margins=c(22,10), cexRow=8, cexCol=2, srtCol=30, lwid=c(1,7), lhei = c(1, 8), legend("top", title = "",legend=c("WaGa parental cell RNA","WaGa wt EV RNA"), fill=c("GREY","GREEN"), cex=1.0, box.lty=0))
        dev.off()
    
        # save cluster members
        subset_1<-names(subset(mycl, mycl == '1'))
        subset_1_ <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
              filters = 'ensembl_gene_id',
              values = subset_1,
              mart = ensembl)
        subset_1_uniq <- distinct(subset_1_, ensembl_gene_id, .keep_all= TRUE)
        subset_1_expr  <- datamat[subset_1,]
        subset_1_expr <- as.data.frame(subset_1_expr)
        subset_1_expr$ENSEMBL = rownames(subset_1_expr)
        cluster1_YELLOW <- merge(subset_1_uniq, subset_1_expr, by.x="ensembl_gene_id", by.y="ENSEMBL")
        #write.csv(cluster1_YELLOW,file='cluster1_YELLOW.txt')
    
        subset_2<-names(subset(mycl, mycl == '2'))
        subset_2_ <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
              filters = 'ensembl_gene_id',
              values = subset_2,
              mart = ensembl)
        subset_2_uniq <- distinct(subset_2_, ensembl_gene_id, .keep_all= TRUE)
        subset_2_expr  <- datamat[subset_2,]
        subset_2_expr <- as.data.frame(subset_2_expr)
        subset_2_expr$ENSEMBL = rownames(subset_2_expr)
        cluster2_DARKBLUE <- merge(subset_2_uniq, subset_2_expr, by.x="ensembl_gene_id", by.y="ENSEMBL")
        #write.csv(cluster2_DARKBLUE,file='cluster2_DARKBLUE.txt')
    
        write_xlsx(list(
        "Cluster 1 YELLOW" = cluster1_YELLOW,
        "Cluster 2 DARKBLUE" = cluster2_DARKBLUE
        ), "DEGs_heatmap_data_WaGa.xlsx")
    
        # -- MKL-1 cell line --
        cut -d',' -f1-1 ./MKL-1_wt.EV_vs_parental-up.txt > MKL-1_wt.EV_vs_parental-up.id
        cut -d',' -f1-1 ./MKL-1_wt.EV_vs_parental-down.txt > MKL-1_wt.EV_vs_parental-down.id
        cat MKL-1_wt.EV_vs_parental-up.id MKL-1_wt.EV_vs_parental-down.id | sort -u > ids  #20720
    
        #add Gene_Id in the first line.
        GOI <- read.csv("ids")$Gene_Id
        RNASeq.NoCellLine <- assay(rld)
    
        #clustering methods: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).  pearson or spearman
        datamat = RNASeq.NoCellLine[GOI, c("MKL-1 parental cell RNA","MKL-1 parental cell RNA 118","MKL-1 parental cell RNA 147", "MKL-1 wt EV RNA","MKL-1 wt EV RNA 2","MKL-1 wt EV RNA 27","MKL-1 wt EV RNA 87","MKL-1 wt EV RNA 118",  "MKL-1 wt EV RNA 042")]
        #Check for Zero Variance Rows; If any row has variance 0, remove it.
        #datamat <- datamat[apply(datamat, 1, var) != 0, ]
        hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="average")
        hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="average")
        mycl = cutree(hr, h=max(hr$height)/1.05)
        mycol = c("YELLOW", "DARKBLUE", "DARKORANGE", "DARKMAGENTA", "DARKCYAN", "DARKRED",  "MAROON", "DARKGREEN", "LIGHTBLUE", "PINK", "MAGENTA", "LIGHTCYAN","LIGHTGREEN", "BLUE", "ORANGE", "CYAN", "RED", "GREEN");
    
        mycol = mycol[as.vector(mycl)]
        sampleCols <- rep('GREY',ncol(datamat))
        names(sampleCols) <- c("MKL-1 parental cell RNA","MKL-1 parental cell RNA 118","MKL-1 parental cell RNA 147", "MKL-1 wt EV RNA","MKL-1 wt EV RNA 2","MKL-1 wt EV RNA 27","MKL-1 wt EV RNA 87","MKL-1 wt EV RNA 118",  "MKL-1 wt EV RNA 042")
    
        sampleCols["MKL-1 parental cell RNA"] <- 'GREY'
        sampleCols["MKL-1 parental cell RNA 118"] <- 'GREY'
        sampleCols["MKL-1 parental cell RNA 147"] <- 'GREY'
        sampleCols["MKL-1 wt EV RNA"] <- 'GREEN'
        sampleCols["MKL-1 wt EV RNA 2"] <- 'GREEN'
        sampleCols["MKL-1 wt EV RNA 27"] <- 'GREEN'
        sampleCols["MKL-1 wt EV RNA 87"] <- 'GREEN'
        sampleCols["MKL-1 wt EV RNA 118"] <- 'GREEN'
        sampleCols["MKL-1 wt EV RNA 042"] <- 'GREEN'
    
        png("DEGs_Heatmap_MKL-1.png", width=1000, height=1200)
        heatmap.2(as.matrix(datamat),Rowv=as.dendrogram(hr),Colv = NA, dendrogram = 'row',
                    scale='row',trace='none',col=bluered(75),
                    RowSideColors = mycol, ColSideColors = sampleCols, labRow="", margins=c(22,10), cexRow=8, cexCol=2, srtCol=30, lwid=c(1,7), lhei = c(1, 8), legend("top", title = "",legend=c("MKL-1 parental cell RNA","MKL-1 wt EV RNA"), fill=c("GREY","GREEN"), cex=1.0, box.lty=0))
        dev.off()
    
        # save cluster members
        subset_1<-names(subset(mycl, mycl == '1'))
        subset_1_ <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
              filters = 'ensembl_gene_id',
              values = subset_1,
              mart = ensembl)
        subset_1_uniq <- distinct(subset_1_, ensembl_gene_id, .keep_all= TRUE)
        subset_1_expr  <- datamat[subset_1,]
        subset_1_expr <- as.data.frame(subset_1_expr)
        subset_1_expr$ENSEMBL = rownames(subset_1_expr)
        cluster1_YELLOW <- merge(subset_1_uniq, subset_1_expr, by.x="ensembl_gene_id", by.y="ENSEMBL")
        #write.csv(cluster1_YELLOW,file='cluster1_YELLOW.txt')
    
        subset_2<-names(subset(mycl, mycl == '2'))
        subset_2_ <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
              filters = 'ensembl_gene_id',
              values = subset_2,
              mart = ensembl)
        subset_2_uniq <- distinct(subset_2_, ensembl_gene_id, .keep_all= TRUE)
        subset_2_expr  <- datamat[subset_2,]
        subset_2_expr <- as.data.frame(subset_2_expr)
        subset_2_expr$ENSEMBL = rownames(subset_2_expr)
        cluster2_DARKBLUE <- merge(subset_2_uniq, subset_2_expr, by.x="ensembl_gene_id", by.y="ENSEMBL")
        #write.csv(cluster2_DARKBLUE,file='cluster2_DARKBLUE.txt')
    
        write_xlsx(list(
        "Cluster 1 YELLOW" = cluster1_YELLOW,
        "Cluster 2 DARKBLUE" = cluster2_DARKBLUE
        ), "DEGs_heatmap_data_MKL-1.xlsx")
  9. Distribution of different RNA Species untreated/wt and parental; 1x für WaGa cell line und 1x für MKL-1 cells

        cd ~/DATA/Data_Ute/Data_RNA-Seq_MKL-1+WaGa/results_2025_1/read_distributions
        cp ../../../Data_RNA-Seq_WaGa/results/rseqc/read_distribution/WaGa_RNAAligned.sortedByCoord.out.read_distribution.txt WaGa_parental_cell_RNA_read_distribution.txt
        cp ../../../Data_RNA-Seq_WaGa/results/rseqc/read_distribution/WaGa_RNA_118Aligned.sortedByCoord.out.read_distribution.txt WaGa_parental_cell_RNA_118_read_distribution.txt
        cp ../../../Data_RNA-Seq_WaGa/results/rseqc/read_distribution/WaGa_RNA_147Aligned.sortedByCoord.out.read_distribution.txt WaGa_parental_cell_RNA_147_read_distribution.txt
        cp ../../../Data_RNA-Seq_WaGa/results/rseqc/read_distribution/WaGa_EV-RNAAligned.sortedByCoord.out.read_distribution.txt  WaGa_wt_EV_RNA_read_distribution.txt
        cp ../../../Data_RNA-Seq_WaGa/results/rseqc/read_distribution/WaGa_EV-RNA_2Aligned.sortedByCoord.out.read_distribution.txt WaGa_wt_EV_RNA_2_read_distribution.txt
        cp ../../../Data_RNA-Seq_WaGa/results/rseqc/read_distribution/WaGa_EV-RNA_118Aligned.sortedByCoord.out.read_distribution.txt WaGa_wt_EV_RNA_118_read_distribution.txt
        cp ../../../Data_RNA-Seq_WaGa/results/rseqc/read_distribution/WaGa_EV-RNA_147Aligned.sortedByCoord.out.read_distribution.txt WaGa_wt_EV_RNA_147_read_distribution.txt
        cp ../../../Data_RNA-Seq_WaGa/results/rseqc/read_distribution/WaGa_EV-RNA_226Aligned.sortedByCoord.out.read_distribution.txt WaGa_wt_EV_RNA_226_read_distribution.txt
        cp ../../../Data_RNA-Seq_WaGa/results/rseqc/read_distribution/1107_WaGa_wt_EVAligned.sortedByCoord.out.read_distribution.txt WaGa_wt_EV_RNA_1107_read_distribution.txt
        cp ../../../Data_RNA-Seq_WaGa/results/rseqc/read_distribution/1605_WaGa_wt_EVAligned.sortedByCoord.out.read_distribution.txt WaGa_wt_EV_RNA_1605_read_distribution.txt
        cp ../../../Data_RNA-Seq_WaGa/results/rseqc/read_distribution/2706_WaGa_wt_EVAligned.sortedByCoord.out.read_distribution.txt WaGa_wt_EV_RNA_2706_read_distribution.txt
    
        cp ../../../Data_RNA-Seq_MKL-1/results/rseqc/read_distribution/MKL-1_RNAAligned.sortedByCoord.out.read_distribution.txt MKL-1_parental_cell_RNA_read_distribution.txt
        cp ../../../Data_RNA-Seq_MKL-1/results/rseqc/read_distribution/MKL-1_RNA_118Aligned.sortedByCoord.out.read_distribution.txt MKL-1_parental_cell_RNA_118_read_distribution.txt
        cp ../../../Data_RNA-Seq_MKL-1/results/rseqc/read_distribution/MKL-1_RNA_147Aligned.sortedByCoord.out.read_distribution.txt MKL-1_parental_cell_RNA_147_read_distribution.txt
        cp ../../../Data_RNA-Seq_MKL-1/results/rseqc/read_distribution/MKL-1_EV-RNAAligned.sortedByCoord.out.read_distribution.txt MKL-1_wt_EV_RNA_read_distribution.txt
        cp ../../../Data_RNA-Seq_MKL-1/results/rseqc/read_distribution/MKL-1_EV-RNA_2Aligned.sortedByCoord.out.read_distribution.txt MKL-1_wt_EV_RNA_2_read_distribution.txt
        cp ../../../Data_RNA-Seq_MKL-1/results/rseqc/read_distribution/MKL-1_EV-RNA_27Aligned.sortedByCoord.out.read_distribution.txt MKL-1_wt_EV_RNA_27_read_distribution.txt
        cp ../../../Data_RNA-Seq_MKL-1/results/rseqc/read_distribution/MKL-1_EV-RNA_87Aligned.sortedByCoord.out.read_distribution.txt MKL-1_wt_EV_RNA_87_read_distribution.txt
  10. RNA fragmentation patterns: is EV-RNA full length or fragmented?

        Since our data consists of single-end RNA-Seq reads, determining the exact RNA fragment length is challenging, as only the read length is available. Unlike paired-end sequencing, where insert size can be calculated, single-end sequencing does not provide direct information about the full fragment size.

Presence-Absence Table and Graphics for Selected Genes in Data_Patricia_Sepi_7samples

ggtree_and_gheatmap_phages.png

ggtree_and_gheatmap_selected_genes.png

  1. run with bengal3

    cd ~/DATA/Data_Denise_CalCov1
    cp bacto-0.1.json ../Data_Denise_CalCov2
    cp cluster.json ../Data_Denise_CalCov2
    cp Snakefile ../Data_Denise_CalCov2
    ln -s /home/jhuang/Tools/bacto/local .
    ln -s /home/jhuang/Tools/bacto/db .
    ln -s /home/jhuang/Tools/bacto/envs .
    
    mkdir raw_data; cd raw_data
    ln -s ../Alignment_Imported_1/20240913_174420/Fastq/HDM7_S1_L001_R1_001.fastq.gz HDM7_R1.fastq.gz
    ln -s ../Alignment_Imported_1/20240913_174420/Fastq/HDM7_S1_L001_R2_001.fastq.gz HDM7_R2.fastq.gz
    ln -s ../Alignment_Imported_1/20240913_174420/Fastq/HDM10_S2_L001_R1_001.fastq.gz HDM10_R1.fastq.gz
    ln -s ../Alignment_Imported_1/20240913_174420/Fastq/HDM10_S2_L001_R2_001.fastq.gz HDM10_R2.fastq.gz
    
    ln -s ../20240812_FS10003086_50_BSB09416-2831/Alignment_Imported_1/20240813_202730/Fastq/HDM1_S1_L001_R1_001.fastq.gz HDM1_R1.fastq.gz
    ln -s ../20240812_FS10003086_50_BSB09416-2831/Alignment_Imported_1/20240813_202730/Fastq/HDM1_S1_L001_R2_001.fastq.gz HDM1_R2.fastq.gz
    ln -s ../20240913/Alignment_Imported_1/20240913_174420/Fastq/HDM7_S1_L001_R1_001.fastq.gz HDM7_R1.fastq.gz
    ln -s ../20240913/Alignment_Imported_1/20240913_174420/Fastq/HDM7_S1_L001_R2_001.fastq.gz HDM7_R2.fastq.gz
    ln -s ../20240913/Alignment_Imported_1/20240913_174420/Fastq/HDM10_S2_L001_R1_001.fastq.gz HDM10_R1.fastq.gz
    ln -s ../20240913/Alignment_Imported_1/20240913_174420/Fastq/HDM10_S2_L001_R2_001.fastq.gz HDM10_R2.fastq.gz
    ln -s ../20240919_FS10003086_61_BSB09416-2735/Alignment_Imported_1/20240920_173408/Fastq/HDM11-SF1_S1_L001_R1_001.fastq.gz HDM11-SF1_R1.fastq.gz
    ln -s ../20240919_FS10003086_61_BSB09416-2735/Alignment_Imported_1/20240920_173408/Fastq/HDM11-SF1_S1_L001_R2_001.fastq.gz HDM11-SF1_R2.fastq.gz
    ln -s ../20240919_FS10003086_61_BSB09416-2735/Alignment_Imported_1/20240920_173408/Fastq/HDM15-SF2_S2_L001_R1_001.fastq.gz HDM15-SF2_R1.fastq.gz
    ln -s ../20240919_FS10003086_61_BSB09416-2735/Alignment_Imported_1/20240920_173408/Fastq/HDM15-SF2_S2_L001_R2_001.fastq.gz HDM15-SF2_R2.fastq.gz
    
    ln -s ../20250203_FS10003086_95_BTR67811-0621/Alignment_Imported_1/20250204_154511/Fastq/HDM2-SF1_S1_L001_R1_001.fastq.gz HDM2-SF1_R1.fastq.gz
    ln -s ../20250203_FS10003086_95_BTR67811-0621/Alignment_Imported_1/20250204_154511/Fastq/HDM2-SF1_S1_L001_R2_001.fastq.gz HDM2-SF1_R2.fastq.gz
    ln -s ../20250203_FS10003086_95_BTR67811-0621/Alignment_Imported_1/20250204_154511/Fastq/HDM20-SF3_S2_L001_R1_001.fastq.gz HDM20-SF3_R1.fastq.gz
    ln -s ../20250203_FS10003086_95_BTR67811-0621/Alignment_Imported_1/20250204_154511/Fastq/HDM20-SF3_S2_L001_R2_001.fastq.gz HDM20-SF3_R2.fastq.gz
    
    # only activate the steps assembly and mlst in bacto-0.1.json.
    mamba activate /home/jhuang/miniconda3/envs/bengal3_ac3
    cd ~/DATA/Data_Patricia_Sepi_7samples
    (bengal3_ac3) jhuang@WS-2290C:~/DATA/Data_Patricia_Sepi_7samples$ /home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
  2. MLST-results

    cd mlst
    cat *.mlst.txt | sort > mlst_sorted
    
    shovill/HDM1/contigs.fa sepidermidis    5   1   1   1   2   2   1   1
    shovill/HDM7/contigs.fa sepidermidis    59  2   1   1   1   2   1   1
    shovill/HDM10/contigs.fa    sepidermidis    59  2   1   1   1   2   1   1
    shovill/HDM11-SF1/contigs.fa    sepidermidis    224 19  16  19  6   3   19  10
    shovill/HDM15-SF2/contigs.fa    sepidermidis    87  7   1   1   2   2   1   1
    shovill/HDM2-SF1/contigs.fa sepidermidis    -   1   2   ~2  2   1   1   10
    shovill/HDM20-SF3/contigs.fa    sepidermidis    -   7   1   ~2  2   4   1   1
  3. (optional) mapping on assembly to calculate the coverage

    #samtools depth input.bam > depth.txt
    #samtools depth input.bam | awk '{sum+=$3} END { print "Average coverage:",sum/NR}'
    #bedtools coverage -a regions.bed -b input.bam > coverage.txt
    #bedtools coverage -a regions.bed -b input.bam -d > coverage_per_base.txt
    
    bwa index ./shovill/HDM1/contigs.fa
    bwa mem ./shovill/HDM1/contigs.fa fastq/HDM1_1.fastq fastq/HDM1_2.fastq > aligned.sam
    samtools view -Sb aligned.sam > aligned.bam
    samtools sort aligned.bam -o sorted.bam
    samtools index sorted.bam
    samtools depth sorted.bam > depth.txt
    awk '{sum+=$3} END { print "Average coverage:",sum/NR}' depth.txt
    bedtools coverage -a regions.bed -b sorted.bam > coverage.txt
    bedtools genomecov -ibam sorted.bam -d > coverage_per_base.txt
    
    # Step 1: Calculate depth using samtools
    samtools depth sorted.bam > depth.txt
    # Step 2: Calculate average depth using awk
    awk '{sum+=$3; count++} END {print "Average Coverage:", sum/count}' depth.txt
    
    # Step 1: Calculate coverage with bedtools for a BED file
    #bedtools coverage -a regions.bed -b input.bam > coverage.txt
    # Step 2: Process the output with awk
    #awk '{ sum+=$7 } END { print "Average coverage depth:", sum/NR }' coverage.txt
    
    bwa index ./shovill/HDM7/contigs.fa
    bwa mem ./shovill/HDM7/contigs.fa fastq/HDM7_1.fastq fastq/HDM7_2.fastq > aligned_HDM7.sam
    samtools view -Sb aligned_HDM7.sam > aligned_HDM7.bam
    samtools sort aligned_HDM7.bam -o sorted_HDM7.bam
    samtools index sorted_HDM7.bam
    # Step 1: Calculate depth using samtools
    samtools depth sorted_HDM7.bam > depth_HDM7.txt
    # Step 2: Calculate average depth using awk
    awk '{sum+=$3; count++} END {print "Average Coverage:", sum/count}' depth_HDM7.txt
    #Average Coverage: 380.079
    
    bwa index ./shovill/HDM10/contigs.fa
    bwa mem ./shovill/HDM10/contigs.fa fastq/HDM10_1.fastq fastq/HDM10_2.fastq > aligned_HDM10.sam
    samtools view -Sb aligned_HDM10.sam > aligned_HDM10.bam
    samtools sort aligned_HDM10.bam -o sorted_HDM10.bam
    samtools index sorted_HDM10.bam
    # Step 1: Calculate depth using samtools
    samtools depth sorted_HDM10.bam > depth_HDM10.txt
    # Step 2: Calculate average depth using awk
    awk '{sum+=$3; count++} END {print "Average Coverage:", sum/count}' depth_HDM10.txt
    #Average Coverage: 254.704
  4. SCCmec typing using Online-Service https://cge.food.dtu.dk/services/SCCmecFinder/ and drawing with clinker

    #1. -- HDM1_contigs.fa --
    
    One SCCmec element detected.
    
    Prediction based on genes:
    Predicted SCCmec element: SCCmec_type_IV(2B)
    
    Prediction based on homology to whole cassette:
    Predicted whole cassette and %template coverage: SCCmec_type_IV(2B) 79.92%
    
    Predicted genes:
    Fasta header % Identity Query/HSP Length Contig Position in contig
    
    ccrA2:7:81108:AB096217  100.00  1350/1350   contig00032 3770..5119
    ccrB2:9:JCSC4469:AB097677   99.94   1650/1650   contig00032 5120..6769
    IS1272:3:AM292304   99.95   1844/1843   contig00032 8611..10454
    dmecR1:1:AB033763   100.00  987/987 contig00032 10443..11429
    mecA:12:AB505628    100.00  2010/2010   contig00032 11526..13535
    
    samtools faidx shovill/HDM1_contigs.fa
    samtools faidx shovill/HDM1_contigs.fa contig00032:1-13635 > HDM1_sub.fna
    bakta --db /mnt/nvme0n1p1/bakta_db HDM1_sub.fna
    
    #2. -- HDM7_contigs.fa --
    
    One SCCmec element detected.
    
    Prediction based on genes:
    Predicted SCCmec element: SCCmec_type_IVa(2B)
    
    Prediction based on homology to whole cassette:
    Predicted whole cassette and %template coverage: SCCmec_type_IVa(2B) 84.24%
    
    Predicted genes:
    Fasta header % Identity Query/HSP Length Contig Position in contig
    
    mecA:12:AB505628    100.00  2010/2010   contig00014 2800..4809
    dmecR1:1:AB033763   99.90   987/987 contig00014 4906..5892
    IS1272:3:AM292304   100.00  1843/1843   contig00014 5881..7723
    ccrB2:3:CA05:AB063172   100.00  1629/1629   contig00014 9565..11193
    ccrA2:3:CA05:AB063172   100.00  1350/1350   contig00014 11215..12564
    subtype-IVa(2B):1:CA05:AB063172 100.00  1491/1491   contig00014 16461..17951
    #IS1272:2:AB033763  91.06   1577/1585   contig00001 369260..370836
    
    samtools faidx shovill/HDM7_contigs.fa
    samtools faidx shovill/HDM7_contigs.fa contig00014:2700-18051 > HDM7_sub.fna
    bakta --db /mnt/nvme0n1p1/bakta_db HDM7_sub.fna
    
    mecA
    dmecR1
    Type I restriction enzyme HindI endonuclease subunit-like C-terminal domain-containing protein
    IS1272
    DUF1643 domain-containing protein
    Pyridoxal phosphate-dependent enzyme
    hypothetical protein
    ccrB2
    ccrA2
    DUF927 domain-containing protein
    hypothetical protein
    ACP synthase
    AAA family ATPase (= subtype-IVa(2B))
    
    #3. -- HDM10_contigs.fa --
    
    Prediction based on genes:
    Predicted SCCmec element: SCCmec_type_IV(2B&5)
    
    Prediction based on homology to whole cassette:
    Predicted whole cassette and % template coverage: SCCmec_type_IV(2B) 84.37%
    
    Predicted genes:
    
    Fasta header % Identity Query/HSP Length Contig Position in contig
    
    subtype-IVa(2B):1:CA05:AB063172 100.00  1491/1491   contig00020 4152..5642
    ccrA2:3:CA05:AB063172   100.00  1350/1350   contig00020 9539..10888
    ccrB2:3:CA05:AB063172   100.00  1629/1629   contig00020 10910..12538
    IS1272:3:AM292304   100.00  1843/1843   contig00020 14380..16222
    dmecR1:1:AB033763   100.00  987/987 contig00020 16211..17197
    mecA:12:AB505628    100.00  2010/2010   contig00020 17294..19303
    #IS1272:2:AB033763  90.75   1579/1585   contig00033 2..1580
    #ccrC1-allele-2:1:AB512767  90.95   1680/1680   contig00022 9836..11515
    
    samtools faidx shovill/HDM10_contigs.fa
    samtools faidx shovill/HDM10_contigs.fa contig00020:4052-19403 > HDM10_sub.fna
    bakta --db /mnt/nvme0n1p1/bakta_db HDM10_sub.fna
    
    #4. -- HDM11-SF1_contigs.fa --
    
    No SCCmec element was detected
    
    Prediction based on genes:
    Predicted SCCmec element: none
    
    Prediction based on homology to whole cassette:
    Predicted whole cassette and %template coverage: none
    
    #5. -- HDM15-SF2_contigs.fa --
    
    SCCmec_type_IV(2B)
    SCCmec_type_VI(4B)
    Following gene complexes based on prediction of genes was detected :
    ccr class 2
    ccr class 4
    mec class B
    
    Predicted genes:
    Fasta header % Identity Query/HSP Length Contig Position in contig
    
    ccrA2:7:81108:AB096217  100.00  1350/1350   contig00004 3823..5172
    ccrB2:9:JCSC4469:AB097677   99.94   1650/1650   contig00004 5173..6822
    IS1272:3:AM292304   99.95   1844/1843   contig00004 8664..10507
    dmecR1:1:AB033763   100.00  987/987 contig00004 10496..11482
    mecA:12:AB505628    100.00  2010/2010   contig00004 11579..13588
    
    subtyppe-Vc(5C2&5):10:AB505629  99.84   1935/1935   contig00004 20148..22082
    
    ccrA4:2:BK20781:FJ670542    90.53   1362/1362   contig00004 24570..25931
    ccrB4:2:BK20781:FJ670542    91.68   1635/1629   contig00004 25928..27562
    
    subtype-IVa(2B):1:CA05:AB063172 100.00  1491/1491   contig00015 52228..53718
    
    samtools faidx shovill/HDM7_contigs.fa
    samtools faidx shovill/HDM7_contigs.fa contig00014:2700-18051 > HDM7_sub.fna
    bakta --db /mnt/nvme0n1p1/bakta_db HDM7_sub.fna
    
    samtools faidx shovill/HDM10_contigs.fa
    samtools faidx shovill/HDM10_contigs.fa contig00020:4052-19403 > HDM10_sub.fna
    bakta --db /mnt/nvme0n1p1/bakta_db HDM10_sub.fna
    
    samtools faidx shovill/HDM15-SF2_contigs.fa
    samtools faidx shovill/HDM15-SF2_contigs.fa contig00004:1-27662 > HDM15-SF2_sub.fna
    samtools faidx shovill/HDM15-SF2_contigs.fa contig00015:52128-53818 >> HDM15-SF2_sub.fna
    bakta --db /mnt/nvme0n1p1/bakta_db HDM15-SF2_sub.fna
    
    #6. -- HDM2-SF1_contigs.fa --
    The input organism was prediced as a MRSA isolate
    The mecA gene was detected
    
    One SCCmec element detected.
    
    Prediction based on genes:
    Predicted SCCmec element: SCCmec_type_IVa(2B)
    
    Prediction based on homology to whole cassette:
    Predicted whole cassette and %template coverage: SCCmec_type_IVa(2B) 87.68%
    
    #7. -- HDM20-SF3_contigs.fa --
    The input organism was prediced as a MSSA isolate
    The mecA/mecC gene was not detected
    No SCCmec element was detected
    Prediction based on genes:
    Predicted SCCmec element: none
    Prediction based on homology to whole cassette:
    Predicted whole cassette and %template coverage: none
    
    #-
    The input organism was prediced as a MSSA isolate
    The mecA/mecC gene was not detected
    
    No SCCmec element was detected
    
    Prediction based on genes:
    Predicted SCCmec element: none
    
    Prediction based on homology to whole cassette:
    Predicted whole cassette and %template coverage: none
    
    #END
    #172.104.140.19
    
    mkdir gbff_sub
    mv *_sub.gbff gbff_sub
    cd gbff_sub
    for f in *_sub.gbff; do mv "$f" "${f/_sub.gbff/.gbff}"; done
    #mv HDM1_sub.gbff HDM1.gbff
    #mv HDM7_sub.gbff HDM7.gbff
    #mv HDM10_sub.gbff HDM10.gbff
    #mv HDM15-SF2_sub.gbff HDM15-SF2.gbff
    
    rm *.json
    clinker *.gbff -p plot_HDRNA.html --dont_set_origin -s session_HDRNA.json -o alignments_HDRNA.csv -dl "," -dc 4
    
    cp ./gbff_HDRNA_01/clinker.png HDRNA_01_clinker.png
  5. Agr typing

    #under env (bakta)
    mamba activate /home/jhuang/miniconda3/envs/bakta
    for sample in HDM1 HDM7 HDM10 HDM11-SF1 HDM15-SF2 HDM2-SF1 HDM20-SF3; do
        bakta --db /mnt/nvme0n1p1/bakta_db shovill/${sample}/contigs.fa --prefix ${sample}
    done
    mv HDM1* bakta_results
    mv HDM2* bakta_results
    mv HDM7* bakta_results
    
    cd bakta_results
    grep "agrD" *.gbff | sort
    
    HDM1.gbff:                     /gene="agrD"
    HDM1.gbff:                     /gene="agrD"
    HDM7.gbff:                     /gene="agrD"
    HDM7.gbff:                     /gene="agrD"
    HDM10.gbff:                     /gene="agrD"
    HDM10.gbff:                     /gene="agrD"
    HDM11-SF1.gbff:                     /gene="agrD"
    HDM11-SF1.gbff:                     /gene="agrD"
    HDM15-SF2.gbff:                     /gene="agrD"
    HDM15-SF2.gbff:                     /gene="agrD"
    
    HDM2-SF1.gbff:                     /gene="agrD"
    HDM2-SF1.gbff:                     /gene="agrD"
    HDM20-SF3.gbff:                     /gene="agrD"
    HDM20-SF3.gbff:                     /gene="agrD"
    
    MNLLGGLLLKIFSNFMAVIGNASKYNPCSNYLDEPQVPEELTKLDE
    MENIFNLFIKFFTTILEFIGTVAGDSVCASYFDEPEVPEELTKLYE
    MENIFNLFIKFFTTILEFIGTVAGDSVCASYFDEPEVPEELTKLYE
    MNLLGGLLLKIFSNFMAVIGNASKYNPCSNYLDEPQVPEELTKLDE
    MNLLGGLLLKIFSNFMAVIGNASKYNPCSNYLDEPQVPEELTKLDE
    
    MENIFNLFIKFFTTILEFIGTVAGDSVCASYFDEPEVPEELTKLYE
    MENIFNLFIKFFTTILEFIGTVAGDSVCASYFDEPEVPEELTKLYE
    
    #* The agr typing is not defined, as I have compared the sequence with the amino acid sequences of AgrD described in the paper available at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4187671/. It does not correspond to Type I, Type II, or Type III. (For more details, see below).
    
    -- AgrD I --
    Query  1       MENIFNLFIKFFTTILEFIGTVAGDSVCASYFDEPEVPEELTKLYE  46
            M  +  L +K F+  +  IG  +  + C  Y DEP+VPEELTKL E
    Sbjct  926825  MNLLGGLLLKIFSNFMAVIGNASKYNPCVMYLDEPQVPEELTKLDE  926688
    
    -- AgrD II --
    Query  1       MNLLGGLLLKIFSNFMAVIGNASKYNPCSNYLDEPQVPEELTKLDE  46
            MNLLGGLLLKIFSNFMAVIGNASKYNPC  YLDEPQVPEELTKLDE
    Sbjct  926825  MNLLGGLLLKIFSNFMAVIGNASKYNPCVMYLDEPQVPEELTKLDE  926688
    
    -- AgrD III --
    Query  1       MNLLGGLLLKLFSNFMAVIGNAAKYNPCASYLDEPQVPEELTKLDE  46
            MNLLGGLLLK+FSNFMAVIGNA+KYNPC  YLDEPQVPEELTKLDE
    Sbjct  926825  MNLLGGLLLKIFSNFMAVIGNASKYNPCVMYLDEPQVPEELTKLDE  926688
  6. (1st optional for mixed) Prepare the tree using roary from the mixed clinical (fastq.gz) and database (fasta) genomes —-roary—-> core_gene_alignement.aln —-iqtree2—-> core_gene_alignment.aln.treefile

    python ~/Scripts/run_mlst.py complete_genome_1282_ncbi.fasta sepidermidis  #73
    python ~/Scripts/run_mlst.py complete_genome_1282_ena.fasta sepidermidis   #73
    python ~/Scripts/run_mlst.py complete_genome_1282_ena_taxon_descendants.fasta sepidermidis
    grep -P "\tsepidermidis\t2\t" all.mlst > ST2.mlst  #73
    grep -v "NZ_" ST2.mlst > ST2_.mlst
    cut -d'.' -f1 ST2_.mlst | sort > ST2.ids
    cut -d'|' -f2 ST2.mlst | sort > ST2.ids
    diff ST2.ids ../ena_descendants_all_ST2_mlst/ST2.ids
    
    for sample in AP029227 CP013943 CP040883 CP045648 CP052939 CP052940 CP052955 CP052959 CP052981 CP052984 CP052990 CP052994 CP053088 CP064461 CP064467 CP064473 CP064476 CP064480 CP064485 CP064499 CP064508 CP064516 CP064525 CP064554 CP064557 CP064560 CP064572 CP064574 CP064587 CP064590 CP064599 CP064603 CP064606 CP064609 CP064613 CP064616 CP064619 CP064624 CP064631 CP064637 CP064650 CP073821 CP073824 CP073827 CP073830 CP073835 CP073836 CP073840 CP073841 CP073844 CP073847 CP073850 CP073852 CP073855 CP073857 CP073859 CP073862 CP073863 CP073878 CP073900 CP073904 CP084008 CP093173 CP093179 CP093181 CP093193 CP093196 CP093198 CP093212 CP097512 CP097514 CP120425 CP133663; do
        cp ../genomes_by_taxid/ncbi_all_ST2_mlst/${sample}.1.fasta ./
    done
    cd typing_complete.csv
    grep -P "\tsepidermidis\t2\t" typing_complete.csv > typing_ST2.csv
    
    echo -e "AP029227\nCP013943\nCP040883\nCP045648\nCP052939\nCP052940\nCP052955\nCP052959\nCP052981\nCP052984\nCP052990\nCP052994\nCP053088\nCP064461\nCP064467\nCP064473\nCP064476\nCP064480\nCP064485\nCP064499\nCP064508\nCP064516\nCP064525\nCP064554\nCP064557\nCP064560\nCP064572\nCP064574\nCP064587\nCP064590\nCP064599\nCP064603\nCP064606\nCP064609\nCP064613\nCP064616\nCP064619\nCP064624\nCP064631\nCP064637\nCP064650\nCP073821\nCP073824\nCP073827\nCP073830\nCP073835\nCP073836\nCP073840\nCP073841\nCP073844\nCP073847\nCP073850\nCP073852\nCP073855\nCP073857\nCP073859\nCP073862\nCP073863\nCP073878\nCP073900\nCP073904\nCP084008\nCP093173\nCP093179\nCP093181\nCP093193\nCP093196\nCP093198\nCP093212\nCP097512\nCP097514\nCP120425\nCP133663" > ids.txt
    cat ids.txt | while read id; do
        efetch -db nuccore -id $id -format gff3 > "${id}.gff"
        efetch -db nuccore -id $id -format fasta > "${id}.fasta"
        # Append the FASTA sequence to the GFF3 file with ##FASTA header
        echo "##FASTA" >> "${id}.gff"
        cat "${id}.fasta" >> "${id}.gff"
        rm "${id}.fasta"  # Optionally remove the separate FASTA file
    done
    for sample in mibi1440 mibi1454 mibi1456 mibi1459 mibi1461 mibi1478 mibi1480 mibi1485 mibi1487 mibi1496 mibi1498 mibi1502 mibi2318 mibi2319; do
        cp ../prokka/${sample}/${sample}.gff ./
    done
    
    #run Roary:
    roary -f roary_ST2 -e --mafft -p 100 roary_input_ST2/*.gff
    #roary -p 15 -f ./roary -i 95 -cd 99 -s -e -n -v roary_.fa.gb.gff CP012351.gff3 CP003084.gff3 CP023676.gff3
    #-i        minimum percentage identity for blastp [95]
    #-cd FLOAT percentage of isolates a gene must be in to be core [99]
    #-f STR    output directory [.]
    #-e        create a multiFASTA alignment of core genes using PRANK
    #-n (==--mafft)       fast core gene alignment with MAFFT, use with -e
    #-s        dont split paralogs
    #-v        verbose output to STDOUT
    
    #The file core_gene_alignment.aln can then be used for tree construction:
    iqtree2 -s core_gene_alignment.aln -m MFP -bb 1000 -nt AUTO
  7. (2nd tree-option for purely clinical samples) Prepare the tree under bengal3_ac3 generating phylogeny_fasttree or phylogeny_raxml using snakemake for all clinical samples

    #http://xgenes.com/article/article-content/372/identify-all-occurrences-of-phages-mt880870-mt880871-and-mt880872-in-s-epidermidis-st2-genomes-from-public-and-clinical-isolates/
    bacto-0.1.json
            "fastqc": false,
            "taxonomic_classifier": false,
            "assembly": true,
            "typing_ariba": false,
            "typing_mlst": true,
            "pangenome": true,
            "variants_calling": true,
            "phylogeny_fasttree": true,
            "phylogeny_raxml": true,
            "recombination": true,
    
    jhuang@WS-2290C:~/DATA/Data_Patricia_Sepi_7samples$ /home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
  8. Preparing gene seqences (The following steps for calulating the presence-absence-matrix for predefined gene list)

    (Optional online search) https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&BLAST_SPEC=MicrobialGenomes
    #Title:Refseq prokaryote representative genomes (contains refseq assembly)
    #Molecule Type:mixed DNA
    #Update date:2024/10/16
    #Number of sequences:1038672
    
    # SCCmec,agr.typing,
    # gyrB (House keeper),
    # fumC,gltA,icd (Metabolic genes),
    # apsS,sigB,sarA,agrC,yycG (Virulence regulators),
    # psm.β (Toxins),    #psm.β1 --> psm.β and delete hlb
    # atlE,sdrG,sdrH,ebh,ebpS,tagB (Biofilm formation),    #ebp-->ebpS
    # pgsC,sepA,dltA,fmtC,lipA,sceD,SE0760 (Immune evasion & colonization),    #capC-->pgsC
    # esp,ecpA (Serine protease)
    
    #under ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates
    #samtools faidx 20250204_Gene_sequences.fasta gltA > gltA_gold.fasta
    #samtools faidx 20250204_Gene_sequences.fasta Psm > psm.β_gold.fasta
    #samtools faidx 20250204_Gene_sequences.fasta ebpS > ebpS_gold.fasta
    #samtools faidx 20250204_Gene_sequences.fasta tagB > tagB_gold.fasta
    #samtools faidx 20250204_Gene_sequences.fasta PgsC > pgsC_gold.fasta
    #samtools faidx 20250204_Gene_sequences.fasta sepA > sepA_gold.fasta
    #samtools faidx 20250204_Gene_sequences.fasta fmtC > fmtC_gold.fasta
    #samtools faidx 20250204_Gene_sequences.fasta sceD > sceD_gold.fasta
    #samtools faidx 20250204_Gene_sequences.fasta esp > esp_gold.fasta
    #samtools faidx 20250204_Gene_sequences.fasta ecpA > ecpA_gold.fasta
    
    cd ~/DATA/Data_Patricia_Sepi_7samples/
    mkdir presence_absence
    cut -d$'\t' -f1-12 typing.csv > typing_f12.csv
    cp typing_f12.csv presence_absence/typing.csv
    
    cd presence_absence
    cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/gyrB_revcomp.fasta gyrB.fasta
    
    cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/fumC_revcomp.fasta fumC.fasta
    cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/gltA_gold.fasta gltA.fasta
    cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/icd_revcomp.fasta icd.fasta
    
    cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/apsS.fasta .
    cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/sigB_revcomp.fasta sigB.fasta
    cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/sarA_revcomp.fasta sarA.fasta
    cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/agrC.fasta .
    cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/yycG.fasta .
    
    cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/psm.β_gold.fasta psm.β.fasta
    #cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/hlb.fasta .
    
    cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/atlE.fasta .
    cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/sdrG.fasta .
    cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/sdrH_revcomp.fasta sdrH.fasta
    cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/ebh_revcomp.fasta ebh.fasta
    cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/ebpS_gold.fasta ebpS.fasta
    cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/tagB_gold.fasta tagB.fasta
    
    cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/pgsC_gold.fasta pgsC.fasta #early is capC
    cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/sepA_gold.fasta sepA.fasta
    cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/dltA.fasta .
    cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/fmtC_gold.fasta fmtC.fasta
    cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/lipA.fasta .
    cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/sceD_gold.fasta sceD.fasta
    cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/SE0760.fasta .
    
    cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/esp_gold.fasta esp.fasta
    cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/ecpA_gold.fasta ecpA.fasta
  9. Perform blastn searching for short-sequencing

    # -- makeblastdb --
    for sample in HDM1 HDM7 HDM10 HDM11-SF1 HDM15-SF2 HDM2-SF1 HDM20-SF3; do
            makeblastdb -in ~/DATA/Data_Patricia_Sepi_7samples/shovill/${sample}/contigs.fa -dbtype nucl
    done
    
    # SCCmec,agr.typing,
    # gyrB(House keeper),
    # fumC,gltA,icd(Metabolic genes),
    # apsS,sigB,sarA,agrC,yycG(Virulence regulators),
    # psm.β (Toxins)    #psm.β1 --> psm.β and delete hlb
    # atlE,sdrG,sdrH,ebh,ebpS,tagB(Biofilm formation),    #ebp-->ebpS
    # pgsC,sepA,dltA,fmtC,lipA,sceD,SE0760(Immune evasion & colonization),    #capC-->pgsC
    # esp,ecpA(Serine protease)
    
    #Note: The current -evalue is set to 1e-1; it might need to be made more stringent.
    for id in gyrB    fumC gltA icd    apsS sigB sarA agrC yycG    psm.β    atlE sdrG sdrH ebh ebpS tagB    pgsC sepA dltA fmtC lipA sceD SE0760    esp ecpA; do
            echo "mkdir ${id}"
            echo "for sample in HDM1 HDM7 HDM10 HDM11-SF1 HDM15-SF2 HDM2-SF1 HDM20-SF3; do"
                    echo "blastn -db ~/DATA/Data_Patricia_Sepi_7samples/shovill/\${sample}/contigs.fa -query ${id}.fasta -evalue 1e-1 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
            echo "done"
    done
    
    python ~/Scripts/process_directory.py gyrB typing.csv typing_until_gyrB.csv
    python ~/Scripts/process_directory.py fumC typing_until_gyrB.csv typing_until_fumC.csv
    python ~/Scripts/process_directory.py gltA typing_until_fumC.csv typing_until_gltA.csv
    python ~/Scripts/process_directory.py icd typing_until_gltA.csv typing_until_icd.csv
    
    python ~/Scripts/process_directory.py apsS typing_until_icd.csv typing_until_apsS.csv
    python ~/Scripts/process_directory.py sigB typing_until_apsS.csv typing_until_sigB.csv
    python ~/Scripts/process_directory.py sarA typing_until_sigB.csv typing_until_sarA.csv
    python ~/Scripts/process_directory.py agrC typing_until_sarA.csv typing_until_agrC.csv
    python ~/Scripts/process_directory.py yycG typing_until_agrC.csv typing_until_yycG.csv
    python ~/Scripts/process_directory.py psm.β typing_until_yycG.csv typing_until_psm.β.csv
    #python ~/Scripts/process_directory.py hlb typing_until_psm.β.csv typing_until_hlb.csv
    python ~/Scripts/process_directory.py atlE typing_until_psm.β.csv typing_until_atlE.csv
    python ~/Scripts/process_directory.py sdrG typing_until_atlE.csv typing_until_sdrG.csv
    python ~/Scripts/process_directory.py sdrH typing_until_sdrG.csv typing_until_sdrH.csv
    
    python ~/Scripts/process_directory.py ebh typing_until_sdrH.csv typing_until_ebh.csv
    python ~/Scripts/process_directory.py ebpS typing_until_ebh.csv typing_until_ebpS.csv
    python ~/Scripts/process_directory.py tagB typing_until_ebpS.csv typing_until_tagB.csv
    python ~/Scripts/process_directory.py pgsC typing_until_tagB.csv typing_until_pgsC.csv
    python ~/Scripts/process_directory.py sepA typing_until_pgsC.csv typing_until_sepA.csv
    python ~/Scripts/process_directory.py dltA typing_until_sepA.csv typing_until_dltA.csv
    python ~/Scripts/process_directory.py fmtC typing_until_dltA.csv typing_until_fmtC.csv
    python ~/Scripts/process_directory.py lipA typing_until_fmtC.csv typing_until_lipA.csv
    
    python ~/Scripts/process_directory.py sceD typing_until_lipA.csv typing_until_sceD.csv
    python ~/Scripts/process_directory.py SE0760 typing_until_sceD.csv typing_until_SE0760.csv
    python ~/Scripts/process_directory.py esp typing_until_SE0760.csv typing_until_esp.csv
    python ~/Scripts/process_directory.py ecpA typing_until_esp.csv typing_until_ecpA.csv
  10. Identify all occurrences of Phages MT880870, MT880871 and MT880872 in S. epidermidis genomes

    # http://xgenes.com/article/article-content/371/identify-all-occurrences-of-phage-hh1-mt880870-in-s-epidermidis-st2-genomes-from-public-and-clinical-isolates/
    # http://xgenes.com/article/article-content/372/identify-all-occurrences-of-phages-mt880870-mt880871-and-mt880872-in-s-epidermidis-st2-genomes-from-public-and-clinical-isolates/
    
    cd presence_absence
    cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/MT880870.fasta .
    cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/MT880871.fasta .
    cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/MT880872.fasta .
    
    samtools faidx MT880870.fasta
    samtools faidx MT880871.fasta
    samtools faidx MT880872.fasta
    for i in {1..51}; do
        num=$(printf "%03d" $((571 + i)))  # Generates 572 to 622
        id="QPB07${num}"
        samtools faidx MT880870.fasta "lcl|MT880870.1_cds_${id}.1_${i}" > ${id}.fasta
    done
    for i in {1..45}; do
        num=$((622 + i))  # Generates 623 to 667
        id="QPB07${num}"
        samtools faidx MT880871.fasta "lcl|MT880871.1_cds_${id}.1_${i}" > ${id}.fasta
    done
    for i in {1..170}; do
        num=$((7667 + i))  # Generates 7668 to 7837
        id="QPB0$num"
        samtools faidx MT880872.fasta "lcl|MT880872.1_cds_${id}.1_${i}" > ${id}.fasta
    done
    
    for i in {572..622}; do
        id_1=$(printf "QPB07%03d" "$((i-1))")
        id=$(printf "QPB07%03d" "$i")
        echo "mkdir ${id}"
        echo "for sample in HDM1 HDM7 HDM10 HDM11-SF1 HDM15-SF2 HDM2-SF1 HDM20-SF3; do"
        echo "blastn -db ~/DATA/Data_Patricia_Sepi_7samples/shovill/\${sample}/contigs.fa -query ${id}.fasta -evalue 1e-10 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
        echo "done"
    done
    
    for i in {623..667}; do
        id=$(printf "QPB07%03d" "$i")
        echo "mkdir ${id}"
        echo "for sample in HDM1 HDM7 HDM10 HDM11-SF1 HDM15-SF2 HDM2-SF1 HDM20-SF3; do"
        echo "blastn -db ~/DATA/Data_Patricia_Sepi_7samples/shovill/\${sample}/contigs.fa -query ${id}.fasta -evalue 1e-10 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
        echo "done"
    done
    
    for i in {668..837}; do
        id=$(printf "QPB07%03d" "$i")
        echo "mkdir ${id}"
        echo "for sample in HDM1 HDM7 HDM10 HDM11-SF1 HDM15-SF2 HDM2-SF1 HDM20-SF3; do"
        echo "blastn -db ~/DATA/Data_Patricia_Sepi_7samples/shovill/\${sample}/contigs.fa -query ${id}.fasta -evalue 1e-10 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
        echo "done"
    done
    
    # -- under under the presence_absence DIR --
    # - for generating a big Excel-table for showing all presence-absence info -
    for i in {572..837}; do
        id_1=$(printf "QPB07%03d" "$((i-1))")
        id=$(printf "QPB07%03d" "$i")
        echo "python ~/Scripts/process_directory.py ${id} typing_until_${id_1}.csv typing_until_${id}.csv"
    done
    #Adapted the first record as: python ~/Scripts/process_directory.py QPB07572 typing_until_ecpA.csv typing_until_QPB07572.csv
    #...
    #cp typing_until_QPB07837.csv typing_all.csv  #Then save typing_all.csv as Excel table typing_all.xlsx!
    
    # - for drawing seperate plots for phages and selected genes -
    for i in {572..622}; do
        #echo "python ~/Scripts/process_directory.py ${id} typing_until_${id_1}.csv typing_until_${id}.csv"
        id_1=$(printf "QPB07%03d" "$((i-1))")
        id=$(printf "QPB07%03d" "$i")
        echo "python ~/Scripts/process_directory.py ${id} typing_until_${id_1}.csv typing_until_${id}.csv"
    done
    #Adapted the first record as: python ~/Scripts/process_directory.py QPB07572 typing.csv typing_until_QPB07572.csv
    #reprace all '+' with 'MT880870' in typing_until_QPB07622.csv
    sed -i 's/+/MT880870/g' typing_until_QPB07622.csv
    
    for i in {623..667}; do
        id_1=$(printf "QPB07%03d" "$((i-1))")
        id=$(printf "QPB07%03d" "$i")
        echo "python ~/Scripts/process_directory.py ${id} typing_until_${id_1}.csv typing_until_${id}.csv"
    done
    sed -i 's/+/MT880871/g' typing_until_QPB07667.csv
    
    for i in {668..837}; do
        id_1=$(printf "QPB07%03d" "$((i-1))")
        id=$(printf "QPB07%03d" "$i")
        echo "python ~/Scripts/process_directory.py ${id} typing_until_${id_1}.csv typing_until_${id}.csv"
    done
    sed -i 's/+/MT880872/g' typing_until_QPB07837.csv
  11. plotTreeHeatmap (draw plot for phages)

    library(ggtree)
    library(ggplot2)
    library(dplyr)
    setwd("~/DATA/Data_Patricia_Sepi_7samples/presence_absence/")
    
    info <- read.csv("typing_until_QPB07837.csv", sep="\t")
    # Convert 'ST' column to factor with levels in the desired order
    info$ST <- factor(info$ST, levels = c("5", "59", "87", "224", "-"))
    info$name <- info$Isolate
    # Prepare the snippy.core.aln.raxml.tree by deleting the reference from snippy.core.aln.raxml.bestTree
    tree <- read.tree("~/DATA/Data_Patricia_Sepi_7samples/raxml-ng/snippy.core.aln.raxml.tree")
    #cols <- c("Public database"='purple2', "Clinical isolate"='darksalmon')  #purple2  skyblue2
    cols <- c("5"="darkred", "59"="cornflowerblue", "87"="orange", "224"="purple2", "-"="grey")
    heatmapData2 <- info %>% dplyr::select(Isolate, QPB07572, QPB07573, QPB07574, QPB07575, QPB07576, QPB07577, QPB07578, QPB07579, QPB07580, QPB07581, QPB07582, QPB07583, QPB07584, QPB07585, QPB07586, QPB07587, QPB07588, QPB07589, QPB07590, QPB07591, QPB07592, QPB07593, QPB07594, QPB07595, QPB07596, QPB07597, QPB07598, QPB07599, QPB07600, QPB07601, QPB07602, QPB07603, QPB07604, QPB07605, QPB07606, QPB07607, QPB07608, QPB07609, QPB07610, QPB07611, QPB07612, QPB07613, QPB07614, QPB07615, QPB07616, QPB07617, QPB07618, QPB07619, QPB07620, QPB07621, QPB07622, QPB07623, QPB07624, QPB07625, QPB07626, QPB07627, QPB07628, QPB07629, QPB07630, QPB07631, QPB07632, QPB07633, QPB07634, QPB07635, QPB07636, QPB07637, QPB07638, QPB07639, QPB07640, QPB07641, QPB07642, QPB07643, QPB07644, QPB07645, QPB07646, QPB07647, QPB07648, QPB07649, QPB07650, QPB07651, QPB07652, QPB07653, QPB07654, QPB07655, QPB07656, QPB07657, QPB07658, QPB07659, QPB07660, QPB07661, QPB07662, QPB07663, QPB07664, QPB07665, QPB07666, QPB07667, QPB07668, QPB07669, QPB07670, QPB07671, QPB07672, QPB07673, QPB07674, QPB07675, QPB07676, QPB07677, QPB07678, QPB07679, QPB07680, QPB07681, QPB07682, QPB07683, QPB07684, QPB07685, QPB07686, QPB07687, QPB07688, QPB07689, QPB07690, QPB07691, QPB07692, QPB07693, QPB07694, QPB07695, QPB07696, QPB07697, QPB07698, QPB07699, QPB07700, QPB07701, QPB07702, QPB07703, QPB07704, QPB07705, QPB07706, QPB07707, QPB07708, QPB07709, QPB07710, QPB07711, QPB07712, QPB07713, QPB07714, QPB07715, QPB07716, QPB07717, QPB07718, QPB07719, QPB07720, QPB07721, QPB07722, QPB07723, QPB07724, QPB07725, QPB07726, QPB07727, QPB07728, QPB07729, QPB07730, QPB07731, QPB07732, QPB07733, QPB07734, QPB07735, QPB07736, QPB07737, QPB07738, QPB07739, QPB07740, QPB07741, QPB07742, QPB07743, QPB07744, QPB07745, QPB07746, QPB07747, QPB07748, QPB07749, QPB07750, QPB07751, QPB07752, QPB07753, QPB07754, QPB07755, QPB07756, QPB07757, QPB07758, QPB07759, QPB07760, QPB07761, QPB07762, QPB07763, QPB07764, QPB07765, QPB07766, QPB07767, QPB07768, QPB07769, QPB07770, QPB07771, QPB07772, QPB07773, QPB07774, QPB07775, QPB07776, QPB07777, QPB07778, QPB07779, QPB07780, QPB07781, QPB07782, QPB07783, QPB07784, QPB07785, QPB07786, QPB07787, QPB07788, QPB07789, QPB07790, QPB07791, QPB07792, QPB07793, QPB07794, QPB07795, QPB07796, QPB07797, QPB07798, QPB07799, QPB07800, QPB07801, QPB07802, QPB07803, QPB07804, QPB07805, QPB07806, QPB07807, QPB07808, QPB07809, QPB07810, QPB07811, QPB07812, QPB07813, QPB07814, QPB07815, QPB07816, QPB07817, QPB07818, QPB07819, QPB07820, QPB07821, QPB07822, QPB07823, QPB07824, QPB07825, QPB07826, QPB07827, QPB07828, QPB07829, QPB07830, QPB07831, QPB07832, QPB07833, QPB07834, QPB07835, QPB07836, QPB07837)  #ST,
    
    rn <- heatmapData2$Isolate
    heatmapData2$Isolate <- NULL
    heatmapData2 <- as.data.frame(sapply(heatmapData2, as.character))
    rownames(heatmapData2) <- rn
    heatmap.colours <- c("darkred", "darkblue", "darkgreen", "grey")
    names(heatmap.colours) <- c("MT880870","MT880871","MT880872","-")
    
    #mydat$Regulation <- factor(mydat$Regulation, levels=c("up","down"))
    #circular
    #geom_tippoint(aes(color=Source)) + scale_color_manual(values=cols)
    p <- ggtree(tree, layout='circular', branch.length='none') %<+% info + geom_tippoint(aes(color=ST), size=4) + scale_color_manual(values=cols) + geom_tiplab2(aes(label=name), offset=1)
    png("ggtree_phages.png", width=1260, height=1260)
    #svg("ggtree_phages.svg", width=1260, height=1260)
    p
    dev.off()
    
    #png("ggtree_and_gheatmap_phages.png", width=1290, height=1000)
    #svg("ggtree_and_gheatmap_phages.svg", width=17, height=15)
    #gheatmap(p, heatmapData2, width=0.1, colnames_position="top", colnames_angle=90, colnames_offset_y = 0.1, hjust=0.5, font.size=4, offset = 5) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 14)) + theme(legend.title = element_text(size = 14)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))
    
    png("ggtree_and_gheatmap_phages.png", width=1290, height=1000)
    gheatmap(p, heatmapData2, width=0.5, colnames_position="top", colnames_angle=90, colnames_offset_y = 0.0, hjust=0.5, font.size=0, offset = 3) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 16)) + theme(legend.title = element_text(size = 16)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))
    dev.off()
  12. plotTreeHeatmap (draw plot for selected genes)

    #FastTree -gtr -nt variants/snippy.core_without_reference.aln > plotTreeHeatmap/snippy.core.tree
    
    #cp ../Data_Holger_S.epidermidis_short/plotTreeHeatmap/typing_104.csv .
    #cp ../Data_Holger_S.epidermidis_short/results_HDRNA_01-20/variants/snippy.core.aln_104.tree .
    #Run step 4
    
    library(ggtree)
    library(ggplot2)
    library(dplyr)
    setwd("~/DATA/Data_Patricia_Sepi_7samples/presence_absence/")
    
    # -- edit tree --
    info <- read.csv("typing_until_ecpA.csv", sep="\t")
    # Convert 'ST' column to factor with levels in the desired order
    info$ST <- factor(info$ST, levels = c("5", "59", "87", "224", "-"))
    
    info$name <- info$Isolate
    tree <- read.tree("~/DATA/Data_Patricia_Sepi_7samples/raxml-ng/snippy.core.aln.raxml.tree")
    cols <- c("5"="darkred", "59"="cornflowerblue", "87"="orange", "224"="purple2", "-"="grey")
    
    heatmapData2 <- info %>% dplyr::select(Isolate,          gyrB,    fumC, gltA, icd,    apsS, sigB, sarA, agrC, yycG,    psm.β,    atlE, sdrG, sdrH, ebh, ebpS, tagB,    pgsC, sepA, dltA, fmtC, lipA, sceD, SE0760,    esp, ecpA)  #ST,
    rn <- heatmapData2$Isolate
    heatmapData2$Isolate <- NULL
    heatmapData2 <- as.data.frame(sapply(heatmapData2, as.character))
    rownames(heatmapData2) <- rn
    
    heatmap.colours <- c("darkgreen", "grey")
    names(heatmap.colours) <- c("+","-")
    
    #heatmap.colours <- c("darkred", "darkblue", "darkgreen", "grey")
    #names(heatmap.colours) <- c("MT880870","MT880871","MT880872","-")
    #heatmap.colours <- c("cornflowerblue","darkgreen","seagreen3","tan","red",  "navyblue", "gold",     "green","orange","pink","purple","magenta","brown", "darksalmon","chocolate4","darkkhaki", "azure3", "maroon","lightgreen",     "blue","cyan", "skyblue2", "blueviolet",       "darkred", "darkblue", "darkgreen", "grey")
    #names(heatmap.colours) <- c("2","5","7","9","14", "17","23",   "35","59","73", "81","86","87","89","130","190","290", "297","325",    "454","487","558","766",       "MT880870","MT880871","MT880872","-")
    #TEMP_DEACTIVATED!  heatmap.colours <- c("cornflowerblue","darkgreen","seagreen3","tan","red",  "navyblue", "purple",     "green","cyan",       "darkred", "darkblue", "darkgreen", "grey",  "darkgreen", "grey")
    #TEMP_DEACTIVATED!  names(heatmap.colours) <- c("SCCmec_type_II(2A)", "SCCmec_type_III(3A)", "SCCmec_type_III(3A) and SCCmec_type_VIII(4A)", "SCCmec_type_IV(2B)", "SCCmec_type_IV(2B&5)", "SCCmec_type_IV(2B) and SCCmec_type_VI(4B)", "SCCmec_type_IVa(2B)", "SCCmec_type_IVb(2B)", "SCCmec_type_IVg(2B)",  "I", "II", "III", "none", "+","-")
    ## The heatmap colours should correspond to the factor levels
    #heatmap.colours <- #setNames(c("cornflowerblue","darkgreen","seagreen3","tan","red","green","orange","pink","brown","magenta","cyan",
    #"magenta","navyblue","cornflowerblue","gold","orange","darkgreen", "green", "seagreen3", #"chocolate4","brown","purple","pink", "tan","cyan","red"),
    #c("5", "23", "35", "69", "86", "87", "130", "224", "487", "640", "none", "P01", "P02", "P03", "P04", "P05", "P06", "P07", #"P08", "P10", "P11", "P12", "P16", "P17", "P19", "P20"))
    
    #mydat$Regulation <- factor(mydat$Regulation, levels=c("up","down"))
    #circular
    #p <- ggtree(tree, layout='circular', branch.length='none') %<+% info + geom_tippoint(aes(color=ST)) + scale_color_manual(values=cols) + geom_tiplab2(aes(label=name), offset=1)
    p <- ggtree(tree, layout='circular', branch.length='none') %<+% info + geom_tippoint(aes(color=ST), size=4) + scale_color_manual(values=cols) + geom_tiplab2(aes(label=name), offset=1)
    png("ggtree_selected_genes.png", width=1260, height=1260)
    #svg("ggtree_selected_genes.svg", width=1260, height=1260)
    p
    dev.off()
    #gheatmap(p, heatmapData2, width=0.1, colnames_position="top", colnames_angle=90, colnames_offset_y = 0.1, hjust=0.5, font.size=4, offset = 5) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 14)) + theme(legend.title = element_text(size = 14)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))
    #svg("ggtree_and_gheatmap_mibi_selected_genes.svg", width=17, height=15)
    
    png("ggtree_and_gheatmap_selected_genes.png", width=1690, height=1400)
    gheatmap(p, heatmapData2, width=4, colnames_position="top", colnames_angle=45, colnames_offset_y = 0.0, hjust=0.5, font.size=5.8, offset = 3.8) + scale_fill_manual(values=heatmap.colours) +  theme(legend.text = element_text(size = 16)) + theme(legend.title = element_text(size = 16)) + guides(fill=guide_legend(title=""), color = guide_legend(override.aes = list(size = 5)))
    dev.off()
  13. (optional) Prepare the long fasta if we use only the long sequencing

    ln -s ../Data_Holger_S.epidermidis_long/finished_assemblies/HDRNA_01_K01_conservative.fasta HDRNA_01_K01.fasta
    ln -s ../Data_Holger_S.epidermidis_long/finished_assemblies/HDRNA_03_K01_bold_bandage.fasta HDRNA_03_K01.fasta
    ln -s ../Data_Holger_S.epidermidis_long/finished_assemblies/HDRNA_06_K01_conservative.fasta HDRNA_06_K01.fasta
    ln -s ../Data_Holger_S.epidermidis_long/finished_assemblies/HDRNA_07_K01_conservative.fasta HDRNA_07_K01.fasta
    ln -s ../Data_Holger_S.epidermidis_long/finished_assemblies/HDRNA_08_K01_conservative.fasta HDRNA_08_K01.fasta
    ln -s ../Data_Holger_S.epidermidis_long/finished_assemblies/HDRNA_12_K01_bold.fasta HDRNA_12_K01.fasta
    ln -s ../Data_Holger_S.epidermidis_long/finished_assemblies/HDRNA_16_K01_conservative.fasta HDRNA_16_K01.fasta
    ln -s ../Data_Holger_S.epidermidis_long/finished_assemblies/HDRNA_17_K01_conservative.fasta HDRNA_17_K01.fasta
    ln -s ../Data_Holger_S.epidermidis_long/finished_assemblies/HDRNA_19_K01_bold.fasta HDRNA_19_K01.fasta
    ln -s ../Data_Holger_S.epidermidis_long/finished_assemblies/HDRNA_20_K01_conservative.fasta HDRNA_20_K01.fasta
  14. (optional) Perform blastn searching for long-sequencing

    #under ~/DATA/Data_PaulBongarts_S.epidermidis_HDRNA/plotTreeHeatmap_long
    cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/gltA_gold.fasta gltA.fasta
    cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/psm.β_gold.fasta psm.β.fasta
    cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/ebpS_gold.fasta ebpS.fasta
    cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/tagB_gold.fasta tagB.fasta
    cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/pgsC_gold.fasta pgsC.fasta #early is capC
    cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/sepA_gold.fasta sepA.fasta
    cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/fmtC_gold.fasta fmtC.fasta
    cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/sceD_gold.fasta sceD.fasta
    cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/esp_gold.fasta esp.fasta
    cp ~/DATA/Data_Luise_Sepi_STKN/plotTreeHeatmap_mibi_isolates/ecpA_gold.fasta ecpA.fasta
    
    # -- makeblastdb --
    for sample in HDRNA_01_K01 HDRNA_03_K01 HDRNA_06_K01 HDRNA_07_K01 HDRNA_08_K01 HDRNA_12_K01 HDRNA_16_K01 HDRNA_17_K01 HDRNA_19_K01 HDRNA_20_K01; do
            makeblastdb -in ../assembly/${sample}.fasta -dbtype nucl
    done
    
    for id in gyrB    fumC gltA icd    apsS sigB sarA agrC yycG    psm.β hlb    atlE sdrG sdrH ebh ebpS tagB    pgsC sepA dltA fmtC lipA sceD SE0760    esp ecpA; do
    #for id in gltA psm.β ebpS tagB pgsC sepA fmtC sceD esp ecpA; do
    echo "mkdir ${id}"
    echo "for sample in in HDRNA_01_K01 HDRNA_03_K01 HDRNA_06_K01 HDRNA_07_K01 HDRNA_08_K01 HDRNA_12_K01 HDRNA_16_K01 HDRNA_17_K01 HDRNA_19_K01 HDRNA_20_K01; do"
            echo "blastn -db ../assembly/${sample}.fasta -query ${id}.fasta -evalue 1e-40 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 > ./${id}/\${sample}.blastn"
    echo "done"
    done
    
    python ~/Scripts/process_directory.py gyrB typing.csv typing_until_gyrB.csv
    python ~/Scripts/process_directory.py fumC typing_until_gyrB.csv typing_until_fumC.csv
    python ~/Scripts/process_directory.py gltA typing_until_fumC.csv typing_until_gltA.csv
    python ~/Scripts/process_directory.py icd typing_until_gltA.csv typing_until_icd.csv
    
    python ~/Scripts/process_directory.py apsS typing_until_icd.csv typing_until_apsS.csv
    python ~/Scripts/process_directory.py sigB typing_until_apsS.csv typing_until_sigB.csv
    python ~/Scripts/process_directory.py sarA typing_until_sigB.csv typing_until_sarA.csv
    python ~/Scripts/process_directory.py agrC typing_until_sarA.csv typing_until_agrC.csv
    python ~/Scripts/process_directory.py yycG typing_until_agrC.csv typing_until_yycG.csv
    python ~/Scripts/process_directory.py psm.β typing_until_yycG.csv typing_until_psm.β.csv
    python ~/Scripts/process_directory.py hlb typing_until_psm.β1.csv typing_until_hlb.csv
    python ~/Scripts/process_directory.py atlE typing_until_hlb.csv typing_until_atlE.csv
    python ~/Scripts/process_directory.py sdrG typing_until_atlE.csv typing_until_sdrG.csv
    python ~/Scripts/process_directory.py sdrH typing_until_sdrG.csv typing_until_sdrH.csv
    
    python ~/Scripts/process_directory.py ebh typing_until_sdrH.csv typing_until_ebh.csv
    python ~/Scripts/process_directory.py ebpS typing_until_ebh.csv typing_until_ebpS.csv
    python ~/Scripts/process_directory.py tagB typing_until_ebp.csv typing_until_tagB.csv
    python ~/Scripts/process_directory.py pgsC typing_until_tagB.csv typing_until_pgsC.csv
    python ~/Scripts/process_directory.py sepA typing_until_capC.csv typing_until_sepA.csv
    python ~/Scripts/process_directory.py dltA typing_until_sepA.csv typing_until_dltA.csv
    python ~/Scripts/process_directory.py fmtC typing_until_dltA.csv typing_until_fmtC.csv
    python ~/Scripts/process_directory.py lipA typing_until_fmtC.csv typing_until_lipA.csv
    
    python ~/Scripts/process_directory.py sceD typing_until_lipA.csv typing_until_sceD.csv
    python ~/Scripts/process_directory.py SE0760 typing_until_sceD.csv typing_until_SE0760.csv
    python ~/Scripts/process_directory.py esp typing_until_SE0760.csv typing_until_esp.csv
    python ~/Scripts/process_directory.py ecpA typing_until_esp.csv typing_until_ecpA.csv
  15. Report

    Attached is the presence-absence table (typing_all.xlsx), which includes the presence and absence status of all selected genes (listed below), as well as all genes from the three phages MT880870, MT880871, and MT880872.

    You may notice some differences in the results for the selected genes compared to the ones I sent earlier. This is because Holger provided updated sequences, correcting those I previously used for the selected genes.

    Additionally, I’ve included two graphics visualizing the presence-absence table:

    • Selected genes: ggtree_and_gheatmap_selected_genes.png
    • Phage genes (MT880870, MT880871, and MT880872): ggtree_and_gheatmap_phages.png

    Here’s the list of selected genes:

    • Housekeeping gene: gyrB
    • Metabolic genes: fumC, gltA, icd
    • Virulence regulators: apsS, sigB, sarA, agrC, yycG
    • Toxins: psmβ
    • Biofilm formation: atlE, sdrG, sdrH, ebh, ebpS, tagB
    • Immune evasion & colonization: pgsC, sepA, dltA, fmtC, lipA, sceD, SE0760
    • Serine protease: esp, ecpA

Processing for Data_Tam_DNAseq_2025

  1. Targets

    Could you please help me to process these data (Project: X101SC25015922-Z01-J001)?
    For you information,
    1. Please compare the data with the AYE strain (CU459141) across the following conditions:
    a) AYE-S
    b) AYE-Q
    c) AYE-WT on Tig4
    d) AYE-craA on Tig4
    e) AYE-craA-1 on Cm200
    f) AYE-craA-2 on Cm200
    2. The "clinical" sample refers to a clinical isolate of Acinetobacter baumannii. I’m unsure which reference genome would be most appropriate for comparison in this case. Can we use lab strains (CP059040, CU459141, and CP079931) as reference genome for comparison?
    
    Processed the genome sequence for project X101SC24115801-Z01-J001?
    1. Kindly compare the data with the ATCC 19606 strain (CP059040) under the following conditions:
    a) adeABadeIJ (knockout of adeA, adeB, adeI, and adeJ, please confirm whether these genes are successfully knocked out.)
    b) adeIJK (knockout of adeI, adeJ, and adeK, please confirm whether these genes are successfully knocked out.)
    c) CM1
    d) CM2
    The "HF" sample may also refer to a clinical isolate of Enterobacter hormaechei.
    2. The "HF" sample refers to a clinical isolate of Acinetobacter baumannii. I’m unsure which reference genome would be most appropriate for comparison in this case. Can we use lab strains (CP059040, CU459141, and CP079931) as reference genome for comparison?

Project Data_Tam_DNAseq_2025_AYE

  1. Download the raw data.

    86e4016c902a1cd23a2190415425e641  01.RawData/AYE-WTonTig4/AYE-WTonTig4_1.fq.gz
    554eb44ae261312039929f0991582111  01.RawData/AYE-WTonTig4/AYE-WTonTig4_2.fq.gz
    ce004b0d7135bce80f34bd6bac3e89e7  01.RawData/AYE-Q/AYE-Q_1.fq.gz
    bddc7ced051a2167a5a8341332d7423a  01.RawData/AYE-Q/AYE-Q_2.fq.gz
    227d93b8a762185d5dcd1e4975041491  01.RawData/AYE-S/AYE-S_1.fq.gz
    f098c9a8579bf5729427dc871225a290  01.RawData/AYE-S/AYE-S_2.fq.gz
    78e08dd090d89330b1021ce42fb09baa  01.RawData/clinical/clinical_1.fq.gz
    2346fef1d896ef0924d2ec88db51cade  01.RawData/clinical/clinical_2.fq.gz
    4c07494505caf22f70edb54692bcaca2  01.RawData/AYE-craA-1onCm200/AYE-craA-1onCm200_1.fq.gz
    52944e395004dc11758d422690bda168  01.RawData/AYE-craA-1onCm200/AYE-craA-1onCm200_2.fq.gz
    92b498ed7465645ca00bbc945c514fe2  01.RawData/AYE-craAonTig4/AYE-craAonTig4_1.fq.gz
    fd9d670942973e6760d6dd78f4ee852a  01.RawData/AYE-craAonTig4/AYE-craAonTig4_2.fq.gz
    375f1e3efb60571ffd457b3cb1e64a84  01.RawData/AYE-craA-2onCm200/AYE-craA-2onCm200_1.fq.gz
    041c08f4c45f1fabd129fc10500c6582  01.RawData/AYE-craA-2onCm200/AYE-craA-2onCm200_2.fq.gz
    c129aa9a208ca47db10bb04e54c096d7  02.Report_X101SC25015922-Z01-J001.zip
    
    md5sum 01.RawData/AYE-WTonTig4/AYE-WTonTig4_1.fq.gz > MD5.txt_
    md5sum 01.RawData/AYE-WTonTig4/AYE-WTonTig4_2.fq.gz >> MD5.txt_
    md5sum 01.RawData/AYE-Q/AYE-Q_1.fq.gz >> MD5.txt_
    md5sum 01.RawData/AYE-Q/AYE-Q_2.fq.gz >> MD5.txt_
    md5sum 01.RawData/AYE-S/AYE-S_1.fq.gz >> MD5.txt_
    md5sum 01.RawData/AYE-S/AYE-S_2.fq.gz >> MD5.txt_
    md5sum 01.RawData/clinical/clinical_1.fq.gz >> MD5.txt_
    md5sum 01.RawData/clinical/clinical_2.fq.gz >> MD5.txt_
    md5sum 01.RawData/AYE-craA-1onCm200/AYE-craA-1onCm200_1.fq.gz >> MD5.txt_
    md5sum 01.RawData/AYE-craA-1onCm200/AYE-craA-1onCm200_2.fq.gz >> MD5.txt_
    md5sum 01.RawData/AYE-craAonTig4/AYE-craAonTig4_1.fq.gz >> MD5.txt_
    md5sum 01.RawData/AYE-craAonTig4/AYE-craAonTig4_2.fq.gz >> MD5.txt_
    md5sum 01.RawData/AYE-craA-2onCm200/AYE-craA-2onCm200_1.fq.gz >> MD5.txt_
    md5sum 01.RawData/AYE-craA-2onCm200/AYE-craA-2onCm200_2.fq.gz >> MD5.txt_
    md5sum 02.Report_X101SC25015922-Z01-J001.zip >> MD5.txt_
    
    ce004b0d7135bce80f34bd6bac3e89e7  AYE-Q_1.fq.gz
    bddc7ced051a2167a5a8341332d7423a  AYE-Q_2.fq.gz

Data process according to http://xgenes.com/article/article-content/325/analysis-of-snps-indels-transposons-and-is-elements-in-5-a-baumannii-strains/

  1. Call variant calling using snippy

    ln -s ~/Tools/bacto/db/ .;
    ln -s ~/Tools/bacto/envs/ .;
    ln -s ~/Tools/bacto/local/ .;
    cp ~/Tools/bacto/Snakefile .;
    cp ~/Tools/bacto/bacto-0.1.json .;
    cp ~/Tools/bacto/cluster.json .;
    
    mkdir raw_data; cd raw_data;
    
    # Note that the names must be ending with fastq.gz
    ln -s ../X101SC25015922-Z01-J001/01.RawData/AYE-S/AYE-S_1.fq.gz AYE-S_R1.fastq.gz
    ln -s ../X101SC25015922-Z01-J001/01.RawData/AYE-S/AYE-S_2.fq.gz AYE-S_R2.fastq.gz
    ln -s ../X101SC25015922-Z01-J001/01.RawData/AYE-Q/AYE-Q_1.fq.gz AYE-Q_R1.fastq.gz
    ln -s ../X101SC25015922-Z01-J001/01.RawData/AYE-Q/AYE-Q_2.fq.gz AYE-Q_R2.fastq.gz
    ln -s ../X101SC25015922-Z01-J001/01.RawData/AYE-WTonTig4/AYE-WTonTig4_1.fq.gz AYE-WT_on_Tig4_R1.fastq.gz
    ln -s ../X101SC25015922-Z01-J001/01.RawData/AYE-WTonTig4/AYE-WTonTig4_2.fq.gz AYE-WT_on_Tig4_R2.fastq.gz
    
    ln -s ../X101SC25015922-Z01-J001/01.RawData/AYE-craAonTig4/AYE-craAonTig4_1.fq.gz AYE-craA_on_Tig4_R1.fastq.gz
    ln -s ../X101SC25015922-Z01-J001/01.RawData/AYE-craAonTig4/AYE-craAonTig4_2.fq.gz AYE-craA_on_Tig4_R2.fastq.gz
    ln -s ../X101SC25015922-Z01-J001/01.RawData/AYE-craA-1onCm200/AYE-craA-1onCm200_1.fq.gz AYE-craA-1_on_Cm200_R1.fastq.gz
    ln -s ../X101SC25015922-Z01-J001/01.RawData/AYE-craA-1onCm200/AYE-craA-1onCm200_2.fq.gz AYE-craA-1_on_Cm200_R2.fastq.gz
    ln -s ../X101SC25015922-Z01-J001/01.RawData/AYE-craA-2onCm200/AYE-craA-2onCm200_1.fq.gz AYE-craA-2_on_Cm200_R1.fastq.gz
    ln -s ../X101SC25015922-Z01-J001/01.RawData/AYE-craA-2onCm200/AYE-craA-2onCm200_2.fq.gz AYE-craA-2_on_Cm200_R2.fastq.gz
    
    #ln -s ../X101SC25015922-Z01-J001/01.RawData/clinical/clinical_1.fq.gz clinical_R1.fastq.gz
    #ln -s ../X101SC25015922-Z01-J001/01.RawData/clinical/clinical_2.fq.gz clinical_R2.fastq.gz
    
    #download CU459141.gb from GenBank
    mv ~/Downloads/sequence\(1\).gb db/CU459141.gb
    #setting the following in bacto-0.1.json
    
        "fastqc": false,
        "taxonomic_classifier": false,
        "assembly": true,
        "typing_ariba": false,
        "typing_mlst": true,
        "pangenome": true,
        "variants_calling": true,
        "phylogeny_fasttree": true,
        "phylogeny_raxml": true,
        "recombination": false, (due to gubbins-error set false)
    
        "genus": "Acinetobacter",
        "kingdom": "Bacteria",
        "species": "baumannii",  (in both prokka and mykrobe)
        "reference": "db/CU459141.gb"
    
    conda activate bengal3_ac3
    (bengal3_ac3) /home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
    
    #check if we need big calculation for including the clinical sample by checking mlst. TODO: send the mlst results to Tam. Next step by check vrap which complete isolate?
  2. Run second run without the clinical sample

    mkdir results_with_clinical
    mv variants results_with_clinical
    mv roary results_with_clinical
    mv fasttree results_with_clinical
    mv raxml-ng results_with_clinical
    mv snippy/clinical/ snippy_clinical
    mv trimmed/clinical_trimmed_*.fastq .
    rm raw_data/clinical_*.fastq.gz
    rm fastq/clinical_*.fastq
    
    (bengal3_ac3) /home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
  3. Using spandx calling variants (almost the same results to the one from viral-ngs!)

    mkdir ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/CP059040
    cp CP059040.gb  ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/CP059040/genes.gbk
    vim ~/miniconda3/envs/spandx/share/snpeff-5.1-2/snpEff.config
    /home/jhuang/miniconda3/envs/spandx/bin/snpEff build CP059040    #-d
    ~/Scripts/genbank2fasta.py CP059040.gb
    mv CP059040.gb_converted.fna CP059040.fasta    #rename "CP059040.1 xxxxx" to "CP059040" in the fasta-file
    ln -s /home/jhuang/Tools/spandx/ spandx
    (spandx) nextflow run spandx/main.nf --fastq "snippy_CP059040/trimmed/*_P_{1,2}.fastq" --ref CP059040.fasta --annotation --database CP059040 -resume

Run vrap for calling the next closely species from the database for the clinical sample!

    ln -s ../X101SC24115801-Z01-J001/01.RawData/HF/HF_1.fq.gz HF_R1.fastq.gz
    ln -s ../X101SC24115801-Z01-J001/01.RawData/HF/HF_2.fq.gz HF_R2.fastq.gz
  1. Download all S epidermidis genomes and identified all ST2 isolates from them!

    #Acinetobacter baumannii Taxonomy ID: 470
    #esearch -db nucleotide -query "txid470[Organism:exp]" | efetch -format fasta -email j.huang@uke.de > genome_470_ncbi.fasta
    #python ~/Scripts/filter_fasta.py genome_470_ncbi.fasta complete_genome_470_ncbi.fasta  #
    
    # ---- Download related genomes from ENA ----
    https://www.ebi.ac.uk/ena/browser/view/470
    #Click "Sequence" and download "Counts" (13059) and "Taxon descendants count" (16091) if there is enough time! Downloading time points is 28.02.2025.
    python ~/Scripts/filter_fasta.py  ena_470_sequence.fasta complete_genome_470_ena_taxon_descendants_count.fasta  #16091-->920
    #python ~/Scripts/filter_fasta.py ena_470_sequence_Counts.fasta complete_genome_470_ena_Counts.fasta  #xxx, 5.8G
  2. Run vrap

    #replace --virus to the specific taxonomy (e.g. Acinetobacter baumannii) --> change virus_user_db --> specific_bacteria_user_db
    ln -s ~/Tools/vrap/ .
    mamba activate /home/jhuang/miniconda3/envs/vrap
    vrap/vrap.py  -1 trimmed/clinical/clinical_1.fq.gz -2 trimmed/clinical/clinical_2.fq.gz -o vrap_clinical --bt2idx=/home/jhuang/REFs/genome  --host=/home/jhuang/REFs/genome.fa --virus=/home/jhuang/DATA/Data_Tam_DNAseq_2025_AYE/complete_genome_470_ena_taxon_descendants_count.fasta --nt=/mnt/nvme0n1p1/blast/nt --nr=/mnt/nvme0n1p1/blast/nr  -t 100 -l 200  -g

Project Data_Tam_DNAseq_2025_adeABadeIJ_adeIJK_CM1_CM2

    (bengal3_ac3) /home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
    #HF is Enterobacter cloacae (550) or Enterobacter hormaechei (158836)

    # ---- Download related genomes from ENA ----
    https://www.ebi.ac.uk/ena/browser/view/550
    #Click "Sequence" and download "Counts" (7263) and "Taxon descendants count" (8004) if there is enough time! Downloading time points is 28.02.2025.
    python ~/Scripts/filter_fasta.py  ena_550_sequence.fasta complete_genome_550_ena_taxon_descendants_count.fasta  #8004-->100
    https://www.ebi.ac.uk/ena/browser/view/158836
    #Click "Sequence" and download "Counts" (3763) and "Taxon descendants count" (4846) if there is enough time! Downloading time points is 28.02.2025.
    python ~/Scripts/filter_fasta.py  ena_158836_sequence.fasta complete_genome_158836_ena_taxon_descendants_count.fasta  #4846-->540
    cat complete_genome_158836_ena_taxon_descendants_count.fasta complete_genome_550_ena_taxon_descendants_count.fasta > complete_genome_158836_550.fasta
    grep "ENA|AP022130|AP022130.1" complete_genome_158836_550.fasta
    #>ENA|AP022130|AP022130.1 Enterobacter cloacae plasmid pWP5-S18-CRE-02_4 DNA, complete genome, strain: WP5-S18-CRE-02.

    ln -s ~/Tools/vrap/ .
    mamba activate /home/jhuang/miniconda3/envs/vrap
    vrap/vrap.py  -1 trimmed/clinical/clinical_1.fq.gz -2 trimmed/clinical/clinical_2.fq.gz -o vrap_clinical --bt2idx=/home/jhuang/REFs/genome  --host=/home/jhuang/REFs/genome.fa --virus=/home/jhuang/DATA/Data_Tam_DNAseq_2025_AYE/complete_genome_470_ena_taxon_descendants_count.fasta --nt=/mnt/nvme0n1p1/blast/nt --nr=/mnt/nvme0n1p1/blast/nr  -t 100 -l 200  -g

Supplementary: Enterobacter cloacae (taxid550) vs. Enterobacter hormaechei (taxid158836)

    🔬 介绍
    阴沟肠杆菌(Enterobacter cloacae) 和 霍尔马氏肠杆菌(Enterobacter hormaechei) 都属于 肠杆菌科(Enterobacteriaceae),是革兰氏阴性、兼性厌氧的杆状细菌。它们广泛存在于 环境中(如水、土壤、植物) 以及 人类和动物的肠道 中。
    🦠 Enterobacter cloacae(阴沟肠杆菌)

    ✅ 特征:

        革兰氏阴性、兼性厌氧、运动性杆菌
        能在多种环境中生存,适应性强
        具有 β-内酰胺酶,能抗多种抗生素

    ✅ 致病性:

        是一种 机会性感染菌,可导致 医院相关感染(HAI),如:
            尿路感染(UTI)
            肺炎
            败血症
            伤口感染

    ✅ 耐药性:

        产生 超广谱β-内酰胺酶(ESBLs) 或 碳青霉烯酶(CRE),对 青霉素、头孢菌素、碳青霉烯类 抗生素具有高耐药性
        医院环境中的 E. cloacae 菌株耐药率较高,治疗较为棘手

    🦠 Enterobacter hormaechei(霍尔马氏肠杆菌)

    ✅ 特征:

        与 E. cloacae 非常相似,也属于 Enterobacter cloacae complex(阴沟肠杆菌复合群)
        在分子水平上与 E. cloacae 略有不同,通常需要 基因测序(如 16S rRNA 或 MLST) 进行区分

    ✅ 致病性:

        也是一种 机会性病原菌,可引起:
            医院感染(如 ICU 患者的感染)
            免疫力低下患者的败血症
            新生儿败血症(可见于 NICU)

    ✅ 耐药性:

        比 E. cloacae 更容易产生耐药性,特别是 碳青霉烯耐药菌株(CRE)
        近年来,E. hormaechei 被认为是 医院爆发性感染的高危菌株

    🔬 主要区别(E. cloacae vs. E. hormaechei)
    特征  Enterobacter cloacae    Enterobacter hormaechei
    分类  阴沟肠杆菌   霍尔马氏肠杆菌
    复合群 Enterobacter cloacae complex    Enterobacter cloacae complex
    致病性 机会性感染   机会性感染,常见于 ICU
    耐药性 可能产生 ESBLs 或 CRE    更容易产生 CRE,耐药率更高
    分子鉴定    16S rRNA 或 MALDI-TOF    需基因测序区分
    医院爆发    少见  常见
    🩺 预防 & 治疗

        加强医院感染控制(如手卫生、环境消毒)
        抗生素敏感性检测(AST):针对耐药菌使用合适的抗生素,如 替加环素、粘菌素
        限制广谱抗生素的使用,避免耐药菌株传播

    总结:
    🔹 E. cloacae 和 E. hormaechei 都是 Enterobacter cloacae complex 的成员,容易引起医院感染
    🔹 E. hormaechei 通常比 E. cloacae 更耐药,尤其是 CRE 菌株
    🔹 临床上需要分子鉴定 以区分它们,并选择合适的治疗方案

    如果是医院感染菌株,建议做 药敏检测(AST),然后选择合适的抗生素进行治疗 🚑💊

A股指数 (000002)

iShares MSCI China A (Acc) 对应的是 MSCI China A Index(摩根士丹利资本国际中国A股指数)。

该指数专门跟踪中国A股市场,即在中国大陆(上海和深圳证券交易所)上市的人民币计价股票。与其他包含H股、B股等的中国市场指数不同,MSCI China A Index仅包括A股,并且专注于反映中国A股市场的表现。

https://q.stock.sohu.com/zs/000002/index.shtml https://www.finanzen.net/etf/ishares-msci-china-a-etf-ie00bqt3wg13

Größte Positionen ISIN: IE00BQT3WG13 (WKN: A12DPT)

Name    ISIN    Marktkapitalisierung    Gewichtung
Kweichow Moutai Co Ltd Class A  CNE0000018R8    248.650.110.128 €   4,66 %
Contemporary Amperex Technology Co Ltd Class A  CNE100003662    €   3,01 %
China Merchants Bank Co Ltd Class A     CNE000001B33    139.867.582.860 €   2,17 %
China Yangtze Power Co Ltd Class A  CNE000001G87    88.357.974.704 €    1,72 %
BYD Co Ltd Class A  CNE100001526    -   1,72 %
Ping An Insurance (Group) Co. of China Ltd Class A  CNE000001R84    120.318.249.561 €   1,41 %
Wuliangye Yibin Co Ltd Class A  CNE000000VQ8    -   1,28 %
Agricultural Bank of China Ltd Class A  CNE100000RJ0    236.795.874.494 €   1,11 %
Industrial And Commercial Bank Of China Ltd Class A     CNE000001P37    322.932.960.525 €   1,10 %
Industrial Bank Co Ltd Class A  CNE000001QZ7    56.853.246.462 €    1,08 %
Summe Top 10    19,27 %

财经短讯

Hang-Seng-Tech-Index (https://www.justetf.com/de/etf-profile.html?isin=IE00BMWXKN31#uebersicht) https://www.handelsblatt.com/finanzen/anlagestrategie/trends/tech-aktien-nach-dem-nvidia-crash-auf-was-anleger-nun-setzen-02/100111816.html

iShares Core Global Aggregate Bond UCITS ETF https://www.ishares.com/ch/professionals/en/products/320494/ishares-core-global-aggregate-bond-ucits-etf

星期四,06.03.2025 效率冠军 NVIDIA | freenet 的黄金时刻 | Euro-DOGE

更多效率,更少官僚主义——这是埃隆·马斯克在美国通过政府效率部 (DOGE) 努力实现的目标。在巴塞罗那举行的世界移动通信大会 (MWC) 上,德国电信 (Telekom) 首席执行官 Tim Höttges 也大力倡导减少布鲁塞尔官僚机构的繁文缛节。今天,我们将探讨哪些美国科技巨头最注重效率,为什么 freenet 重新发现了电视业务的潜力,以及欧洲如何在“无人机群战”中取得优势。

“我们有媒体监管、网络安全监管、数据保护监管、本地和国际电信监管。我们需要一个减少官僚主义的倡议。” ——德国电信 (Telekom) 首席执行官 Tim Höttges 在全球最大移动通信展 MWC 上

移动通信

freenet 的新机遇

德国移动通信市场由德国电信 (Deutsche Telekom)、英国沃达丰 (Vodafone) 和西班牙电信 (Telefónica) 主导。如果不想被这个寡头垄断市场束缚,消费者可以考虑1&1 这一第四大运营商,该公司自 2023 年起努力增加市场竞争。然而,由于覆盖范围仍然存在漏洞,1&1 用户经常会无意间使用沃达丰更完善的 D2 网络。两家公司为此签订了漫游协议。

在这种情况下,独立移动通信品牌或许是更好的选择。在德国,freenet 是最大的独立移动通信提供商,并且旗下运营着 Klarmobil 等多个品牌。用户不仅可以在 freenet 选择不同的套餐,还可以自由选择德国三大移动通信网络。

2024 年,freenet 的移动通信业务较为稳定。到 12 月底,该公司拥有 770 万 活跃订阅用户,同比增长 2.3%。核心业务收入增长 0.8%,达到 21 亿欧元。然而,总营收增长 3.9%,达到 25 亿欧元,这主要得益于 freenet 旗下迅速扩展的电视业务。

freenet 旗下的媒体部门通过 waipu.tv 内容平台,实现了 240 万 订阅用户,同比增加近 25%。该部门营收增长 15.8%,达到 4 亿欧元。整体 EBITDA(息税折旧摊销前利润)增长 3.5%,达到 5.215 亿欧元。

未来展望方面,受益于 waipu.tv 的增长,freenet 预计即使在减少广告支出的情况下,用户数量仍会大幅上升,盈利也将稳步增长。此外,公司提议将每股股息提高 11% 至 1.97 欧元,并计划回购 1 亿欧元 的股票。过去五年,freenet 的股价累计上涨 近 85%。

本周数据

排行榜:效率决定一切 美国科技巨头人均收入和盈利能力

🔹 2024 年,NVIDIA 在员工效率方面独占鳌头。
🔹 该公司每位员工平均创造 360 万美元 的收入,远超苹果 (Apple) 和 Meta 1.5 倍,几乎是 Alphabet(Google 母公司)的两倍。
🔹 在人均利润方面,NVIDIA 也遥遥领先,每位员工创造 200 万美元 的利润,大幅超越 Apple、Meta、Alphabet 和 Broadcom。

为什么 NVIDIA 如此高效?
✅ 精简团队:相比其他科技巨头,NVIDIA 员工总数最少,业务范围也相对单一。
✅ 专注研发:公司绝大多数员工专注于高性能芯片的研发,而不是管理庞大的多元化业务。
✅ 深耕核心业务:不像 Apple、Meta、Amazon 或 Broadcom 需要庞大人力支持多个业务部门,NVIDIA 几乎全力投入 GPU 及 AI 计算的开发。

这一切似乎验证了一个老生常谈的观点:人才才是企业最宝贵的资源。

今日重要日程

企业财报

📌 Admiral Group - 年度财报
📌 Air France-KLM - 年度财报
📌 Broadcom - Q1 财报
📌 德意志邮政 (DHL) - 年度财报
📌 德国汉莎航空 - 年度财报
📌 Zalando - 年度财报
📌 Merck - 年度财报
📌 Universal Music Group - 年度财报

经济数据

📌 11:00 - 欧元区 1 月零售销售数据
📌 14:15 - 欧洲央行 (ECB) 利率决议

​德国汉莎航空公司在2024年经历了复杂的财务表现。​全年调整后息税前利润(EBIT)下降超过三分之一,至16.5亿欧元,低于2023年的26.8亿欧元,但略高于分析师预期。 ​这一下降主要归因于劳动力和维护成本上升、俄罗斯领空限制以及飞机交付延迟等因素。​
reuters.com
+3
chinaerospace.com
+3
guandian.cn
+3
reuters.com
+3
chinatimes.com
+3
chinaerospace.com
+3

尽管全年利润下滑,汉莎航空在2024年第四季度表现出复苏迹象。​强劲的乘客数量和燃料成本下降使公司业绩超出分析师预期。​这一积极趋势导致股价上涨,表明市场对2025年的乐观预期。 ​公司预计2025年将是“过渡阶段”,并计划通过重组计划在2028年前实现25亿欧元的收益。​此外,尽管面临挑战,航空旅行需求依然强劲,预订趋势在年初表现积极。​
chinaerospace.com
reuters.com
chinatimes.com
+2
cn.investing.com
+2
reuters.com
+2

总体而言,汉莎航空在2024年面临多重挑战,但第四季度的积极表现和对未来的战略规划为2025年的复苏奠定了基础。

阅读推荐

欧洲的“无人机战争”

无人机群战对于任何防空系统都是一项严峻挑战。欧洲防务局 (EDA) 正在尝试将“星球大战”式技术引入现实,以提升欧洲的军事防御能力。点击“heise online”了解更多。

星期三,2025年3月5日 BuBa失败、未来的化石以及Prada对LVMH的回应

中国PC制造商联想推出了一款18英寸可折叠显示屏的笔记本电脑,利用太阳能部分自行产生电力。这是来自中国的创新还是“中国产垃圾”?唐纳德·特朗普显然无法理解这种产品。在美国,电力依然来自插座,石油来自地下。美国总统生活在一个英国石油公司BP也希望重返的世界。

由于特朗普昨天将针对中国的关税加倍至20%,便宜的中国笔记本、电子汽车和智能手机在美国的价格将上涨。在我们的Scalable播客《Asset Class》中,我们的首席经济学家Christian W. Röhl因此采访了一位亚洲专家,探讨亚洲世纪是否已经取消。此外,Röhl还在专栏中深入分析了德国联邦银行的历史损失及其巨大的黄金储备。最后,你将了解Prada是否变得狂妄,以及如何将你的投资组合完整迁移到新的Scalable平台。

播客: Hat China sich verzockt? (中国是不是玩得过头了?)

当所有人都在讨论美国时,在我们的播客《Asset Class – 关于增长与价值》中,我们聚焦于日本、中国、印度和亚洲四小龙:在与作者及亚洲专家Karl Pilny教授的对话中,Scalable的首席经济学家Christian W. Röhl讨论了亚洲的经济与地缘政治发展。

此外,哪些国家对投资者来说潜力最大?中国的百年计划是什么?德国是否可以向日本学习债务管理?以及中国快速的经济增长与股市低迷之间的差异究竟在哪里?

[点击此处观看YouTube视频 – 本期播客也可以在所有播客平台收听。]

Christian W. Röhl 尽管亏损数十亿,德国联邦银行仍保持稳定

尽管自欧元引入以来已有25年,欧洲央行负责货币政策的管理,但德国联邦银行依然被视为国内安全和稳固的象征。上周,当法兰克福的货币监管者公布其财务报表时,大家惊讶不已:2024年其账面亏损达到192亿欧元,这是自1979年以来的首次亏损。难道德国最后的稳定支柱也崩塌了吗?

并非如此。凭借2670亿欧元的储备,德国联邦银行依然拥有巨大的资产。然而,自2022年由Joachim Nagel领导以来,这个机构受到了利率变化的压力:自2015年起,为了减轻国家和经济负担,德国联邦银行购买了大量债券,但这些债券几乎没有收益。而去年,德国联邦银行还需要以平均3.8%的利率支付商业银行存款。

随着零利率、低利率和负利率的货币政策结束,利率环境逐步正常化,债券的估值损失将逐步恢复,新的资金将能以更高的回报率进行投资。然而,直到德国联邦银行能够再次像1980年至2019年那样每年将利润汇入国库,国际货币基金组织预计至少还需要八年时间。在此之前,损失必须被弥补,储备需要恢复,而增值的超过3500吨黄金储备将不会被动用。毕竟,中央银行的任务不是盈利,而是保持货币稳定。

石油公司 回到化石燃料的未来

自2002年起,BP不再仅代表“British Petroleum”,还包括“better people”和“beyond petroleum”。在其形象转型过程中,英国石油公司还更换了新的标志,旨在象征太阳花,传达“石油和天然气是20世纪的事情”这一信息。

然而,现在BP又重新启动了石油泵。原因很简单:化石燃料又变得值得投资。尤其是由于俄罗斯天然气供应中断,欧洲的能源价格在过去几年大幅上涨。美国自特朗普总统上任以来,推行了“钻探,宝贝,钻探!”的政策。

尽管过去五年BP的股价仅上涨了9%,但其竞争对手壳牌的股价却上涨了超过30%。目前,壳牌在股市的市值达到1930亿欧元,是BP的近两倍。而法国的道达尔能源市值为1310亿欧元,约为BP的1.5倍。美国的竞争者雪佛龙和埃克森美孚的市值则更大。

上周三,BP首席执行官Murray Auchincloss宣布将加大对石油和天然气业务的投资,并削减与可再生能源相关的昂贵项目。据《商报》报道,“我认为从长远来看,投资者会喜欢这一转变。”

来自Elliott Investment Management的支持对这一决策也很积极。该美国投资公司长期以来一直要求BP做出转变。同时,Elliott还要求BP出售部分资产,以减轻其巨额债务。Auchincloss表示,到2027年底,BP计划出售约200亿美元的资产,包括价值高达100亿美元的Castrol润滑油业务。

社区问题 迁移到新Scalable平台时如何处理股票碎片?

在2025年第四季度,我们将为您提供一个迁移服务,确保您能顺利将您的证券从我们合作伙伴Baader银行转移到新的Scalable平台。在此迁移过程中,不仅是完整的股份,股票碎片也将被完整转移。与传统的证券转移不同,您无需出售这些碎片。此外,这项服务完全免费。由于证券转移本身不会产生资本利得税,因此您无需担心税务问题。

今天的日程安排

公司

    Abercrombie & Fitch:第四季度财报
    Adidas:年度财报(详细)
    Atos:年度财报
    Bayer:年度财报
    Dassault Aviation:年度财报
    Evonik Industries:年度财报
    Foot Locker:第四季度财报
    Schaeffler:年度财报
    SCOR:年度财报
    Telecom Italia:年度财报(详细)

经济数据

    08:30 瑞士2月通胀率
    11:00 欧元区1月生产者价格
    14:15 美国2月ADP就业变动
    16:00 美国1月工业订单
    22:00 DAX指数成分股调整

阅读推荐 Prada的购物之旅

早在21世纪初,意大利时尚品牌Prada曾尝试成为下一个LVMH,虽然未能成功。为什么此次通过收购Versace的计划可能会再次重启,请阅读《WELT》上的详细报道。

星期二,2025年3月4日 德国ETF,Intel的半数股权和双重奖金

您曾通过电脑或智能手机挖掘过加密货币吗?几年前,这可能还需要一些耐心,而今天则需要高度专业化的矿机。所谓的“专用集成电路”(ASICs)目前是比特币、以太坊等加密货币高效挖掘的技术标准。

这些高科技设备中的芯片来自GlobalFoundries、三星和台积电的工厂。Intel曾经也提供ASICs,但其生产在2023年停止了。如今,遭遇困境的芯片公司正自愿出售自己。Broadcom和台积电这些竞争对手看中了哪些优质资产,您可以在我们这里了解。此外:为什么现在可能还为时过早去掠夺德国经济,以及如何通过ETF投资欧洲“病人”的突然康复。

本周数字 5种加密货币

根据美国总统唐纳德·特朗普的说法,计划将5种加密货币纳入美国的战略储备:比特币、以太坊、XRP、Solana和Cardano。在竞选期间,特朗普曾宣布希望设立战略比特币储备。作为基础,将使用美国政府过去因犯罪活动没收的212,000比特币。预计还会将其他加密货币纳入美国储备,这一消息令市场惊讶并导致币价大幅波动,Cardano的价格上涨最为显著。

芯片制造商 半数股权为半数半数的半导体?

Intel曾经的辉煌如今已经不复存在:在首席执行官Pat Gelsinger于去年12月被解职后,这家美国芯片公司几乎处于没有领导的状态——并且很可能会被竞争对手蚕食。据《华尔街日报》报道,台积电和博通都在考虑独立收购Intel。

美国科技巨头博通被认为对Intel的芯片设计部门感兴趣,双方已经进行了初步的非正式谈判。但在制造部门,博通还需要寻找其他买家。据悉,台积电有可能接管Intel的制造业务,然而这个提议并不是来自Intel,而是来自美国政府,特别是特朗普政府。不过,这样的交易会面临其他挑战。有消息人士透露,特朗普只有在满足一系列条件后才会同意Intel与非美国背景的台积电之间的交易。

对于德国的经济来说,Intel位于马格德堡的厂房是否仍会继续建设,尤为引人关注。据《MDR》报道,虽然计划中的科技园区仍会继续进行,但Intel是否会成为最大的租户,鉴于公司目前的困境,这一可能性已变得更加渺茫。

相比之下,博通的情况则较为轻松:过去五年里,博通的股价上涨了670%以上,公司的市值一度突破了一万亿美元大关。周四晚上,博通将发布第四季度财报。

聚焦ETF 大联盟推动地方联赛

在德国,基督教民主联盟和社民党正在讨论新的特别基金:据报道,这两个党派在联合政府谈判中讨论了9000亿欧元的新债务,其中4000亿欧元用于德国军队,5000亿欧元用于德国基础设施的修复。

这是欧洲病人的治疗方法吗?毕竟,去年德国(与奥地利并列)在欧盟经济增长中排名最后。德国目前已经经历了两个衰退年份,并预计今年的国内生产总值(GDP)增长也将非常微弱。

对于DAX指数来说,九倍的刺激计划对其强劲的股价表现影响最小:德国的蓝筹股指数,依靠出口导向的公司,过去12个月有所上涨——在过去五年中,DAX的得分几乎翻了一倍。

与此不同,德国第二大股指MDAX表现较弱:虽然MDAX也包含了很多出口导向型公司,但德国本土经济在其中扮演了更重要的角色。据估计,MDAX的50家公司大约三分之一的收入来自德国市场。预计中型企业将从军事和基础设施支出的数百亿欧元中受益更多。该指数的表现亟需改进:过去五年,仅上涨了约16%。

德国股市规模最小、但依赖国内市场最多的SDAX指数依赖度更高。估计SDAX超过40%的收入来自德国市场。尽管近年来经济疲软,SDAX在过去五年里仍增长了三分之一以上。

Amundi SDAX (Dist)  
iShares MDAX (Acc)  
Xtrackers DAX (Acc)

第 218 期 | 2025 年 2 月 28 日,星期五: 中国 ETF、慕尼黑再保险公司以及 3150 亿美元的 AI 投资

“诺克斯堡”(Fort Knox) 是位于美国肯塔基州的一个军事基地,众所周知,它是世界上最大的黄金储备库。据称,该地存放了约 4,580 公吨 黄金,总价值约 4,310 亿美元。这笔财富甚至可能超出全球最大再保险公司的承受范围。但这些黄金真的还在那里吗?美国效率管理局 DOGE 的负责人 埃隆·马斯克(Elon Musk)似乎对此心存怀疑,他希望亲自去确认金条是否仍然完好无损。

在此之前,他还向美国政府官员提出了一个棘手的问题。与此同时,我们也在思考,马斯克的老板是否已经失去了对自己关税计划的掌控,以及特朗普(Donald Trump)签署的原材料协议中是否隐藏着什么玄机。

我们的首席经济学家 Christian W. Röhl 阅读了 沃伦·巴菲特(Warren Buffett)的最新股东信,并在最新一期的《Asset Class》播客中与总编辑 Alexander Langer 讨论了一只备受追捧的股票。此外,我们还将分析科技巨头在 AI 数据中心上的惊人投资,并介绍一家受益于这一趋势的公司——每一座新建的数据中心对他们来说都像是一场盛大的局域网(LAN)派对。

MEISTGEHANDELT

买入最多的股票

    Under Armour(安德玛)
    Höegh Autoliners
    Arista Networks
    Mara Holdings
    British American Tobacco(英美烟草)

卖出最多的股票

    Volkswagen(大众)
    Deutsche Lufthansa(德国汉莎航空)
    RWE
    Infineon Technologies(英飞凌科技)
    Alibaba Group(阿里巴巴集团)

(数据基于 2025 年 2 月 21 日至 2 月 27 日 Scalable Broker 平台上 100 只最活跃股票的买卖比例)

科技网络崛起

众所周知,英伟达(NVIDIA) 是整个科技行业的风向标,无论向上还是向下。周四凌晨,这家美国芯片巨头公布了 2024 年第四季度财报,引发了整个科技界的高度关注。虽然 2025 年初科技板块表现低迷,但英伟达的亮眼业绩暂时遏制了市场的进一步下跌。

另一家受益于英伟达行业影响力的公司是 Arista Networks。这家由 Andy Bechtolsheim 创立的美国网络设备制造商,刚刚公布了季度财报,表现超出市场预期。尽管其最大客户 Meta 宣布减少采购 Arista 的产品(此前 Meta 贡献了 21% 的收入,现降至 15%),但 Arista 仍然实现了稳健增长,部分原因是 Oracle 计划加大对其解决方案的采用力度。

Arista 在 2024 年第四季度的每股收益为 0.65 美元,高于市场预期的 0.57 美元。同时,公司营收达到 19.3 亿美元,超出市场预期的 19 亿美元。展望 2025 年,Arista 预计收入增长 17%,高于此前 15% 的预期。

MÄRKTE UND MAKRO

特朗普(Trump)还能保持平衡吗?

在中国,几乎每个人都听过**《三十六计》**,其中第 27 计大致可以翻译为:“装疯卖傻,但不失平衡”。那么,美国总统 特朗普(Donald Trump) 是否也在运用类似策略呢?

本周三,特朗普宣布计划对欧盟进口商品征收 25% 的关税,涵盖汽车及其他产品。这是否意味着他真的要动真格,还是只是在提升谈判筹码?

值得注意的是,特朗普对墨西哥和加拿大的关税计划已经被搁置,原因是两国承诺加强边境管控,以遏制非法毒品贸易。白宫表示,新的关税政策将于下周二正式生效,但特朗普本人则声称实施日期是 4 月 2 日,因为他“有点迷信”,不想选择 4 月 1 日 这个日期。

与此同时,特朗普还与乌克兰总统 泽连斯基(Volodymyr Zelenskyy) 达成协议:乌克兰未来将把能源资源销售收入的 50% 存入一个基金,该基金将用于乌克兰国内投资,但投资项目必须让美国受益。这是否意味着这些投资会优先流向美国企业?答案可能会在本周五正式揭晓。

而欧盟呢? 作为全球最大经济体之一,欧盟目前的影响力仍然受到成员国内部矛盾和德国经济疲软的拖累。德国的消费者信心指数 下降至 -24.7,创下近一年来的新低,尽管实际工资上涨了 3.1%(过去 16 年来的最高增幅)。

CHRISTIAN W. RÖHL

巴菲特的洞察 “奥马哈先知”的一针见血

沃伦·巴菲特(Warren Buffett) 近日发布了 伯克希尔哈撒韦(Berkshire Hathaway) 的年度财报和股东信。虽然他未直接提及 特朗普 或 马斯克,但他强调了稳定货币政策的重要性,并呼吁社会关注“那些因外部因素而陷入困境的人”。

此外,他表示:“美国一直有骗子,他们利用信任骗取他人积蓄。”这句话似乎暗指特朗普的高估值媒体公司及其家族发行的 Meme Coins。

有趣的是,巴菲特这次的美国赞美 似乎有所保留。他表示:“即使没有伯克希尔,美国依然会取得成功。”同时,他强调了亚洲市场的重要性,特别是他在 日本商社 的投资已为他带来了 100 亿美元 的账面收益(增长 70%)。

Podcast

Realty Income:这只股票为何如此受欢迎?

在我们最新一期的播客 《Asset Class – über Wachstum & Werte》(资产类别 – 关于增长与价值) 中,主持人 Christian W. Röhl 深入探讨了这家自称为 “每月派息公司” 的企业为何如此具有吸引力。

与我们的编辑 Alexander Langer 一起,他回溯了 Realty Income 在加利福尼亚的创业历程,并分析了它如何从一家小型餐饮租赁伙伴发展成为全球领先的房地产投资信托基金(REIT)。此外,他们还探讨了这家看似神秘的公司 未来可能面临的机遇与挑战。

📺 点击这里观看 YouTube 视频 – 当然,你也可以在各大播客平台收听这期节目! 🎧

Rückversicherer

Munich Re on fire

2024 hatte es in sich: In den USA verwüsteten die Hurricanes Helene und Milton Teile des Landes, in Mitteleuropa sorgte im September Hochwasser dafür, dass zehntausende Menschen evakuiert werden mussten und in Spanien kostete eine Flutkatastrophe im Oktober hunderte Menschenleben. Klingt nach einem schwierigen Jahr für Versicherungskonzerne wie die Münchener Rück, die Erstversicherer gegen Großschadensereignisse absichern. Doch der Eindruck täuscht.

Der Nettogewinn des weltgrößten Rückversicherers schnellte im abgelaufenen Jahr um 23 % im Vergleich zum Vorjahr auf den Rekordwert von 5,7 Mrd. €. Das lag weit über den von der Münchener Rück angepeilten 5 Mrd. € und schlug auch die Erwartungen der Kapitalmärkte. Der Wachstumsmotor war wieder einmal das Rückversicherungsgeschäft, dessen Gewinn um 26 % auf 4,9 Mrd. € wuchs. Die restlichen 0,8 Mrd. € erzielte die Münchener Rück mit ihrem Tochterunternehmen ERGO.

Obwohl die verheerenden Waldbrände in Los Angeles die Münchener Rück rund 1,2 Mrd. € kosten dürften, stehen die Zeichen auch für das laufende Geschäftsjahr gut. Der Versicherer könnte von der steigenden Anzahl an Naturkatastrophen sogar finanziell profitieren, da höhere Risiken auch teurere Rückversicherungsverträge ermöglichen. Für das Gesamtjahr 2025 peilt man einen Konzerngewinn von 6 Mrd. € an. Eine satte Dividende kommt obendrauf: Die Münchener Rück will die Ausschüttung um ein Drittel auf 20 € pro Aktie erhöhen. Außerdem sollen auch Aktien im Wert von 2 Mrd. € zurückgekauft werden, um den Kurs weiter zu festigen. In den vergangenen fünf Jahren stiegen die Anteile der Münchener Rück bereits um knapp 140 % im Wert.

ETF 聚焦

中国刺激政策

中国 CSI 300 指数 目前仍徘徊在 五年前的水平,资本市场对中国的信心依旧不足,房地产市场的消费疲软 也对经济形成压力。

为提振消费,中国政府推出了一项“家电以旧换新补贴”,涉及微波炉、洗碗机、智能手表等设备,总计补贴金额超过 100 亿欧元。此外,公务员工资 也将大幅提高,以促进国内消费。

推荐投资中国的 ETF:
✅ Amundi MSCI China A II UCITS ETF Dist
✅ Xtrackers CSI 300 Swap UCITS ETF 1C
✅ iShares MSCI China Tech UCITS ETF USD (Acc)

本周图表

“轻资产”时代已结束

(美国科技巨头计划投资支出)
来源:Scalable Capital,企业公告

长期以来,科技行业一直宣扬这样的理念:创新和高速增长 只需一间“传奇车库”,再加上一杯绿茶和一副降噪耳机即可实现。然而,如今的现实却截然不同——美国科技巨头的资本支出(Capex)正迅猛增长。

    Alphabet 计划在 2025 年投资 750 亿美元
    Amazon 宣布支出高达 1000 亿美元
    Microsoft 计划投入 800 亿美元,专注于 AI 数据中心
    Meta(Facebook 母公司) 计划投资 650 亿美元,用于建设“像一座城市一样庞大的 AI 数据中心”

综合来看,这四大科技巨头今年的总投资额预计将超过 3150 亿美元。这是什么概念?这大致相当于德国市值最高的 DAX 企业 SAP 加上制药巨头拜耳(Bayer)的总市值。

但问题来了,这些西海岸科技巨头的投资是否真的能带来回报?毕竟,它们最近刚被中国 AI 新贵 DeepSeek 短暂超越了一次,让许多科技浪漫主义者感叹不已。或许,在中国,“车库创业精神” 依然未曾消失。

Video

In der aktuellen Folge des Video-Podcasts echtgeld.tv geht es um Aktien großer und kleiner deutscher Unternehmen. Außerdem war Julius Weller, unser Vice President Broker, zu Gast, um Scalables neues Private-Equity-Angebot vorzustellen. Hier geht es zum YouTube-Video.

Sie möchten als Erstes vom neuen Angebot profitieren? Tragen Sie sich hier auf der Warteliste ein.

LESETIPP

Exportstopp für Kobalt

Der Preis für den Rohstoff Kobalt fällt auf dem Weltmarkt dramatisch. Jetzt will die Demokratische Republik Kongo als Hauptlieferant den Handel aussetzen, um die Preise zu stabilisieren – mit weitreichenden Auswirkungen auf die Weltwirtschaft. Wer alles beteiligt und betroffen ist, lesen Sie bei „WELT“.

    根据可用数据,以下是德国主要电子交易平台按交易量排序的列表:

    Xetra:由德意志交易所集团运营,是德国最重要的股票交易平台。2022年,Xetra、法兰克福证券交易所和Tradegate Exchange的总交易额达到1.8万亿欧元。
    FINANCEFEEDS.COM

    Tradegate Exchange:专注于私人投资者的证券交易所。2021年12月,Tradegate Exchange的交易额为264亿欧元。
    EUREX.COM

    法兰克福证券交易所(Börse Frankfurt):由德意志交易所集团运营。2021年12月,法兰克福证券交易所的交易额为39亿欧元。
    EUREX.COM

    司徒加特证券交易所(Börse Stuttgart):德国第三大证券交易所,平均占德国股票交易量的约35%。2021年,所有类别的资产交易额为1,070亿欧元。
    ZH.WIKIPEDIA.ORG

    EIX(European Investor Exchange):由BÖAG Börsen AG和Scalable GmbH合作运营,专注于私人投资者。EIX于2024年12月在汉诺威证券交易所推出,提供约15,000种证券的交易。

    gettex:慕尼黑证券交易所的电子交易系统,提供超过27,000种证券的交易。

    Quotrix:杜塞尔多夫证券交易所的电子交易平台,提供超过16,000种证券的交易。

    LS Exchange:由Lang & Schwarz AG与汉堡证券交易所合作运营,提供超过10,000种证券的交易。

    > https://help.scalable.capital/das-neue-scalable-a15d5ee8/kann-ich-weiterhin-entscheiden-an-welcher-b%C3%B6rse-ich-han-66a10758

    Kann ich weiterhin entscheiden, an welcher Börse ich handeln möchte?
    Ja, Sie haben die Wahl, an welcher Börse Sie handeln möchten. Neben der European Investor Exchange (Börse Hannover) stehen Ihnen gettex (Börse München) und Xetra (Deutsche Börse) zur Verfügung. Die Kosten für den Handel über gettex und Xetra haben sich nicht geändert.

    Mit Scalable Crypto handeln Sie Kryptowährungen so einfach wie Aktien oder ETFs: als Crypto-ETPs* auf regulierten Handelsplätzen (European Investor Exchange, gettex, Xetra) in Deutschland. Eine Anlage ist sowohl als Einmalkauf als auch ab 1 € pro Monat im Sparplan möglich. Es wird hierfür kein separates Konto, Wallet oder eine erneute Identifizierung benötigt.

    Die erworbenen Wertpapiere werden sicher in Ihrem Depot verwahrt. Der Gegenwert Ihres Investments in der entsprechenden Kryptowährung wird für Sie durch den Emittenten bei Kryptoverwahrstellen verwahrt. Im Gegensatz zum Handel über Kryptobörsen entfällt für Sie die Notwendigkeit der eigenständigen Verwahrung in einer eigenen Crypto-Wallet.

    > Depot bei Scalable Capital

    Sie können Ihre Order über die European Investor Exchange, gettex (Börse München) oder Xetra (Deutsche Börse) zur Ausführung beauftragen.

    Die Börse Hannover betreibt das elektronische Handelsystem European Investor Exchange gemeinsam mit Scalable Capital. gettex ist ein Börsenplatz der Bayerischen Börse AG. Beim Handel auf der European Investor Exchange und auf gettex werden keine Gebühren seitens der Börse oder der Market Maker erhoben. Hier fallen grundsätzlich weder Maklercourtage noch Börsenentgelt an¹.

    Xetra ist ein Börsenplatz der Deutschen Börse AG. Die Ordergebühr beträgt pauschal 3,99 €, auch im PRIME+ Broker. Hinzu kommt die Handelsplatzgebühr. Diese beträgt lediglich 0,01 % des ausgeführten Volumens (mindestens 1,50 €). Die Handelsplatzgebühr deckt alle Drittkosten für Handel und Abwicklung.

    ¹ Ggf. fallen Produktkosten, Spreads, Zuwendungen und Crypto-Gebühren an. Alle Kosten im Überblick finden Sie in der Kostenübersicht.

    > https://help.scalable.capital/steuern-de12f868/wann-wird-die-vorabpauschale-abgebucht-c6bf9231/?DTStep=01J1SKGBYFH5WGA7H8FHX1TWQ9&

    Wann wird die Vorabpauschale abgebucht (Depot bei Scalable Capital (DEUTDEFFVAC))?

    请注意,以下回复仅适用于居住在德国的客户。

    年初,税务机关会对预提税(Abgeltungsteuer)进行预扣,以确保投资者层面的最低税收。

    如果您在2024年12月31日持有未分配(累积型)或部分分配的投资基金,则需缴纳预提税。如果您当时未持有相关证券,则无需缴税。

    您可以使用justETF的ETF税务计算器计算应缴税款金额。

    您的税负将与您的个人免税额度(例如免税申请或有效的免纳税证明)以及可能的普通亏损抵扣账户进行结算。如果没有相应的免税安排,预提税将按25%的资本利得税率扣除。

    预提税预计将于2025年1月从您的结算账户扣款,请确保账户余额充足。请注意:如果扣款导致账户出现透支,可能会产生透支利息。了解更多信息。

    关于预提税的更多信息,您可以在我们的证券词典中查阅。

https://www.finanzfluss.de/vergleich/depot/neobroker/ https://www.finanzfluss.de/vergleich/depot/

  1. Restnutzungsdauer Gutachten

https://www.immobilienscout24.de/lp/nutzungsdauer-quickcheck.html?utm_source=Iterable&utm_medium=email&utm_campaign=HO_E_SA_landlord_NDG_ndgweekly_ZG3_kw09_2025

Gutachten erhalten
Die nach DIN EN ISO/IEC 17024 zertifizierten
Sachverständigen von Sprengnetter erstellen
dein Gutachten entsprechend sämtlicher
Anforderungen des Finanzamts – auf
Wunsch mit Ortsbesichtigung. Bei
Rückfragen unterstützen wir dich
selbstverständlich kostenlos.

Nutzungsdauer-Gutachten beauftragen
Wie spare ich mit einem
Restnutzungsdauer Gutachten Steuern?
Die meisten Mietimmobilien werden standardisiert über einen
Zeitraum von 50 Jahren ab dem Kaufzeitpunkt mit einem Satz von 2% abgeschrieben. Je nach Baujahr und Ausstattung der Immobilie ist
diese jedoch in Wahrheit deutlich kürzer. Mittels eines Gutachtens
kannst du die kürzere Nutzungsdauer deiner Immobilie beim
Finanzamt nachweisen und so Steuern sparen.

Hinweis: Liegt der Kauf der Immobilie bereits einige Jahre zurück, ist
es möglich, dass der steuerliche Vorteil geringer ist, da bereits ein Teil
des Wertes abgeschrieben ist. Konkrete Auskunft kann ein:e
Steuerberater:in geben.
  1. 2025年2月27日,星期四

慕尼黑再保险利润飓风 | 3150亿美元投资于人工智能等领域

当前的地缘政治形势是否让您感到不安?有时候,一点音乐可以帮助我们调整视角。如果您听一听比利·乔尔的《We Didn’t Start the Fire》,您会发现,世界一直以来都处于动荡之中。在这首歌中,他在副歌之外,仅仅罗列了一系列自1949年他出生至1989年歌曲发布期间发生的重大事件。

如今,这些动荡不仅影响人们的情绪,还直接影响着全球最大的再保险公司的财务状况。虽然去年全球自然灾害频发,导致保险公司面临巨额赔付,但它们依然对未来充满信心。同时,美国的科技巨头们也在加大投资,因为他们坚信,无论当下形势如何,明天依然充满机遇。即使刚果减少了钴出口,也无法阻止科技进步的脚步。 再保险行业

慕尼黑再保险火力全开

2024年是充满挑战的一年:飓风“海伦”和“米尔顿”席卷美国,9月份的洪灾导致数万人在中欧被迫撤离,西班牙在10月份的洪水中损失惨重,数百人遇难。这似乎是再保险公司,如慕尼黑再保险,最不愿看到的情况,毕竟它们负责为保险公司分担重大灾害风险。然而,现实却超出预期。

2024年,全球最大再保险公司的净利润同比增长23%,达到创纪录的57亿欧元,远超此前预计的50亿欧元,也超出了市场的预期。增长的主要动力仍然是再保险业务,其利润增长26%,达到49亿欧元,而子公司ERGO贡献了8亿欧元。

尽管洛杉矶的森林大火可能给慕尼黑再保险带来约12亿欧元的损失,但整体前景依然乐观。随着自然灾害的增多,保险公司的风险敞口扩大,这反而推动了更高的再保险费率,带来了额外的收入。公司预计2025年的净利润将达到60亿欧元,并计划将每股股息提高三分之一,达到20欧元。此外,公司还将回购价值20亿欧元的股票,以进一步稳固股价。在过去五年里,慕尼黑再保险的股价已上涨近140%。
知识小百科
什么是再保险?

再保险是保险公司的保险。像安联(Allianz)、安盛(AXA)或忠利(Generali)这样的保险公司,可以将部分风险转嫁给再保险公司,以降低自身因重大损失事件而陷入财务困境的可能性。作为交换,它们需要向再保险公司支付一定的保费。欧洲知名的再保险公司包括慕尼黑再保险、汉诺威再保险(Hannover Re)和瑞士再保险(Swiss Re)。

当发生保险理赔时,首先是保险公司向客户赔付,随后,保险公司可能会向再保险公司申请部分或全部赔偿。具体补偿方式取决于合同条款:再保险公司可以承担一定比例的理赔金额,或者只在损失超过某个预定阈值时才进行赔付。
本周图表

“轻资产”时代已成过去

美国大型科技公司的资本支出

多年来,科技行业的一个核心理念是,创新和高速增长可以在“车库”里发生——只需要足够的绿茶和降噪耳机。然而,如今情况已经发生了变化。美国科技巨头的资本支出(Capex)正在迅速增长:

    谷歌母公司Alphabet 计划在2025年投资 750亿美元,
    亚马逊 预计资本支出将达到 1000亿美元,
    微软 将投入 800亿美元,主要用于人工智能数据中心,
    Meta(Facebook母公司) 计划投资 650亿美元,以建设一个“堪比一座城市”的人工智能数据中心。

总计,这四家科技巨头将在2025年投入 3150亿美元 用于推动人工智能项目的发展。作为对比,这相当于德国最有价值的DAX公司 SAP 的市值,再加上制药巨头 拜耳(Bayer) 的市值。

这些投资最终是否能带来回报?毕竟,美国科技巨头最近曾短暂地被中国的人工智能初创公司 DeepSeek 超越。对于怀旧的科技爱好者而言,这至少说明,“车库创业精神”在某些地方依然存在。

钴出口禁令

钴这种原材料的价格在全球市场上大幅下跌。作为主要供应国的刚果民主共和国现在计划暂停交易,以稳定价格——这将对全球经济产生深远影响。哪些国家和企业参与其中,谁将受到影响,详情请见《WELT》。
  1. In Deutschland gibt es mehrere bedeutende Handelsplätze, die nach ihrem jährlichen Handelsvolumen geordnet wie folgt aufgelistet werden können:

    • Xetra for Scalable Capital: Der elektronische Handelsplatz der Deutschen Börse dominiert den deutschen Aktienhandel mit einem Anteil von etwa 90 % am gesamten Handelsvolumen aller deutschen Börsen. Im Jahr 2024 wurde auf Xetra ein Orderbuchumsatz von rund 1,27 Billionen (1.270 Milliarden) Euro erzielt. xetra.com

    • Tradegate Exchange: Ein auf Privatanleger fokussierter Handelsplatz mit einem jährlichen Handelsvolumen von etwa 105,8 Milliarden Euro im Jahr 2018. alleaktien.com

    • Börse Stuttgart for Scalable Capital: Bekannt für den Handel mit verbrieften Derivaten und einem Handelsvolumen von rund 90 Milliarden Euro im Jahr 2023. de.wikipedia.org (EIX (European Integrated Exchange) ist tatsächlich mit der Börse Hannover verbunden, wird aber von der Börse Stuttgart betrieben. Es handelt sich um eine Handelsplattform, die speziell auf den außerbörslichen Handel mit strukturierten Produkten wie Derivaten fokussiert ist. Die Börse Hannover selbst ist eher klein und spielt im deutschen Handelsvolumen eine untergeordnete Rolle. EIX wird oft mit EUWAX (ebenfalls von der Börse Stuttgart) verglichen, da beide Plattformen für den Handel mit Zertifikaten, Optionsscheinen und anderen strukturierten Produkten genutzt werden.)

    • Börse Frankfurt: Ein bedeutender Handelsplatz mit einem Orderbuchumsatz von etwa 41,85 Milliarden Euro im Jahr 2024. xetra.com

    • Börse München for Scalable Capital (Gettex): Fokussiert auf Privatanleger, mit einem Handelsvolumen von etwa 11,5 Milliarden Euro im Jahr 2018. alleaktien.com

    • Börse Berlin: Spezialisiert auf ausländische Unternehmen und Fremdwährungsanleihen, mit einem Handelsvolumen von rund 4,6 Milliarden Euro im Jahr 2018. alleaktien.com

    • Börse Hamburg for Trade Republic: Mit einem Handelsvolumen von etwa 3,2 Milliarden Euro im Jahr 2018. alleaktien.com (LANG & SCHWARZ Exchange (LSX) ist eine elektronische Handelsplattform, die unter der Aufsicht der Börse Hamburg steht. Das bedeutet, dass sie zwar eigenständig agiert, aber regulatorisch mit der Börse Hamburg verbunden ist. Die Börse Hamburg selbst ist eine der kleineren deutschen Börsen, aber durch LSX ist sie im außerbörslichen Handel mit Wertpapieren, insbesondere für Privatanleger, relevant.)

    • Börse Düsseldorf: Fokussiert auf Privatanleger und Unternehmensanleihen, mit einem Handelsvolumen von rund 1,9 Milliarden Euro im Jahr 2018. alleaktien.com

2025年2月26日

投资传奇沃伦·巴菲特在高龄之际,是否开始动摇他曾经坚定不移的对美国的信念?我们的首席经济学家Christian W. Röhl仔细研读了巴菲特的年度股东信,并在字里行间发现了对埃隆·马斯克和唐纳德·特朗普的批评。而日本股市却获得了巴菲特的赞誉。值得注意的是,日本化工企业Asahi Kasei并未出现在巴菲特的投资组合中。该公司不仅生产塑料和建筑材料,甚至还建造预制房屋,同时也是全球最大的透析设备制造商之一。而在这一领域的全球市场领导者则来自德国,该公司在过去一年里推行精简战略,并取得了成功。此外,在我们最新一期的播客《Asset Class》中,Christian W. Röhl与我们的主编Alexander Langer探讨了一只备受投资者喜爱的美国股票。

  1. Realty Income:为什么这只股票如此受欢迎?

    在最新一期播客《Asset Class——关于增长与价值》中,主持人Christian W. Röhl深入分析了这家自称为“每月分红公司”(the monthly dividend company)的企业的独特魅力。他与编辑Alexander Langer一起回顾了Realty Income的成长历程,从加州一家小型快餐连锁店的房地产供应商起步,发展成为全球知名的房地产投资信托基金(REIT)。他们还探讨了,这家看似神秘的企业未来可能面临的挑战。

  2. Christian W. Röhl:奥马哈先知的针锋相对

    每年二月的某个星期六,沃伦·巴菲特都会发布他的伯克希尔·哈撒韦(Berkshire Hathaway)年度财报。全球投资者不仅关注公司的业绩数据,更期待巴菲特在《股东信》中的观点、建议和智慧。今年尤其值得关注的,是他对美国新总统颠覆性政策的看法。

    尽管巴菲特没有直接提及唐纳德·特朗普或埃隆·马斯克,但他对稳定货币政策的支持已经表达了一切。同时,他强调“关注那些在生活中不幸落后的人”,这似乎是在批评由马斯克领导的DOGE监管机构的激进改革。此外,巴菲特指出,美国“从来不缺少骗子”,这些人“利用信任他们的投资者”,这一表述可以被解读为对特朗普旗下高估值媒体公司或其家族推出的“迷因币”(Meme Coins)的暗讽。

    值得注意的是,巴菲特此次对美国的赞美不再像以往那样充满信心。几年前,他还坚定地宣称:“永远不要与美国对赌(Don’t bet against America)。” 而这次,他仅表示,伯克希尔的成功故事只能在美国发生,而美国即便没有伯克希尔,也依然会取得成功。同时,他也开始关注亚洲市场。在长达十三页的股东信中,有一整页内容专门讨论了伯克希尔在日本五大商社(包括伊藤忠商事)的投资,这项投资六年来已带来100亿美元的账面收益,增长率达到70%。

    尽管如此,巴菲特并没有计划抛售这些持股,而是希望长期持有,并稳步增加投资。这一策略完全符合他在今年股东信中写下的一句经典投资箴言:

    👉 “错误终将消失,而赢家可以永远绽放”(Mistakes fade away, winners can forever blossom.)

  3. 透析设备行业:节约成本的同时仍保持高额分红

    当肾脏功能衰竭时,如果不加治疗,将会导致死亡。虽然肾脏移植是最佳解决方案,但患者往往需要等待长达十年。人工透析技术诞生于1924年,并在德国吉森大学医院首次成功应用于人类。如今,它依然是帮助患者维持生命的关键技术。透析机通过导管将血液引入机器,模拟肾脏功能,过滤掉毒素和代谢废物。

    全球最大的透析设备上市制造商是德国的费森尤斯医疗(Fresenius Medical Care, FMC)。虽然FMC在过去财年的收入停滞不前,但其财报公布后,股价仍然获得市场正面反馈。原因在于,该公司推行的成本削减计划带来了利润增长。

    FMC 2024年业绩概览:
        营业利润:增长1.7%至14亿欧元
        净利润:增长7.8%至4.99亿欧元
        每股分红:计划提高至1.44欧元(当前**3.2%**股息收益率)

    调整后的财报更加亮眼:

    调整后营业利润:增长18%至18亿欧元
    调整后净利润:增长42%至9.12亿欧元

    FMC预计2025年货币调整后的收入将有低个位数增长,而调整后的营业利润可能增长高两位数至高二十几百分点。

  4. 社区问答:Scalable Broker存款利息是多少?

    Scalable Broker提供的存款利率取决于:

    您的账户类型(FREE Broker或PRIME+ Broker)
    存款金额
    资金存放的具体账户

    当前的存款利率如下(基于欧洲央行存款利率2.75%):

    FREE Broker:仅对最高50,000欧元的存款支付利息
    PRIME+ Broker:对最高500,000欧元的存款支付利息

    如果您的资金仍存放在Baader Bank账户:

    FREE Broker:无利息
    PRIME+ Broker:对最高100,000欧元的存款支付**1.6%**的年利率

    如何将资金转移到利率更高的账户?

    先将资金提现至您的个人银行账户(Girokonto)
    然后使用银行转账、直接扣款或Scalable Instant存入您的Scalable账户
    务必使用新账户的IBAN,以确保获取更高的存款利率

2024年12月

美国对于全球许多人而言是一个令人向往的地方,同时也是全球经济的引擎。在过去几年里,美国市场的回报率几乎无可匹敌。我们将向您展示,如何通过投资美国ETF,以极低的管理成本进入全球最大的经济体。

然而,在未来几十年内,中国可能会成为世界上最大的经济体。在本期新闻通讯中,我们将探讨为何即使如此,也可能是个好主意,在您的投资组合中减少中国市场的比重。不过,您仍然可以保留新兴市场的投资。Xtrackers推出的“剔除中国”ETF(ex-China ETF)正是为此而设计的。

如果您喜欢“躺平”投资,让专业人士帮您管理收益,可以关注iShares推出的主动型ETF。这款基金采用基金中的基金(FOF)结构,全球配置,并包含债券这一额外的资产类别。

在本期ETC专题中,我们破例不谈黄金,而是聊聊白银。

  1. 美国ETF

    从“苹果”中分一杯羹**

    全球市值最大的公司大多在美国证券交易所上市,那里聚集了全球超过一半的市场资本。纽约华尔街的动向直接影响全球股市。最近几个月,受人工智能(AI)带来的科技股行情、美联储(Fed)即将降息的预期以及美国稳健的经济发展推动,美国指数屡创新高。科技股和成长股尤其受到美联储宽松货币政策的预期影响。

    如果想要专注投资全球最大经济体的企业,可以关注Amundi MSCI USA UCITS ETF Acc。该基金追踪MSCI USA指数,涵盖约600只大型和中型企业股票,行业多元化,涉及科技、金融、医药等多个板块。

    Amundi MSCI USA UCITS ETF Acc

    资产类别: 股票
    ISIN代码: IE000FSN19U2
    复制方式: 实物复制
    收益处理: 累积(再投资)
    总费用比率(TER): 0.03%
    主要持仓:
        苹果(Apple):7.0%
        英伟达(Nvidia):6.4%
        微软(Microsoft):5.9%
    数据截至: 2024年12月11日
  2. ETC专题

    白银同样珍贵?**

    白银被称为“普通人的黄金”,也是一种重要的投资资产。在过去五年里,白银价格以美元计上涨超过80%。2024年对于贵金属来说尤其是丰收的一年。

    黄金需求主要来自珠宝行业和各国央行,而白银的主要需求方是工业领域,如计算机、电池和催化剂等。因此,相较于黄金,白银的需求与整体经济发展更加紧密相关。现代投资组合理论建议将贵金属纳入资产配置,以提高投资组合的多样化程度。

    如果想要投资白银,而又不想持有实物,可以选择交易所交易商品(ETC)。这类金融产品的价值与黄金、白银等大宗商品挂钩,并且有实物背书。本质上,它们是可以在交易所买卖的债券,不需要投资者自己存放实物白银。ETC产品通常无固定期限,例如WisdomTree Core Physical Silver,提供了一种便捷、低成本的白银投资方式。

  3. 本季度图表

    全球GDP增长预测(%)**

    未来经济格局的演变

    根据高盛(Goldman Sachs)的研究,预计到2050年,中国将成为全球最大经济体,到2075年,印度也将超越美国成为世界第一。当前全球第三大经济体——德国,50年后可能会跌至第9位,被印度尼西亚、尼日利亚、巴基斯坦、埃及和巴西等新兴经济体超越。

    在许多目前仍然发展中的国家,经济正在快速增长。例如,印尼的GDP预计在未来50年内增长十倍。特别是这些国家相对年轻的人口结构,使得它们在全球经济中的占比不断扩大。

    印度是一个很好的例子。如今,它已经超过中国,成为全球人口最多的国家(超过14亿人)。2007年,印度GDP首次突破1万亿美元,到2024年,预计将增长至3.9万亿美元,17年内几乎翻了四倍。

    根据国际货币基金组织(IMF)的预测,未来几年印度经济仍将保持6.5%的增长率,而中国的增长率预计会继续下降。五年内,中国经济的增速可能仅为印度的一半。而相较于这两个新兴市场,美国的经济增长虽然较低,但依然保持稳定。

  4. 新兴市场ETF

    全球投资,不含中国**

    拉美、亚洲和非洲正在崛起,许多国家的GDP增速超过已开发市场。例如,巴西、印度和南非等国的经济增速远高于欧美国家。投资者可以通过新兴市场ETF,将这些市场纳入投资组合。

    然而,大多数新兴市场ETF中,中国企业的占比非常高。例如,Xtrackers MSCI Emerging Markets ETF的持仓中,超过四分之一是中国公司,印度市场仅占20%左右。这种结构带来的集中风险,在过去五年中并未带来理想的回报。

    相比之下,印度市场表现强劲,印度股票在MSCI India指数中的表现近五年增长了约80%,而中国市场受房地产危机(如恒大破产)和中美贸易战等因素影响,表现不佳。

    如果想规避中国市场的风险,可以选择Xtrackers MSCI Emerging Markets ex China ETF。该基金专注于中国以外的新兴市场,其中印度和台湾市场分别占比27%,韩国市场约占12%。过去五年,该基金的表现比包含中国的版本高出近10%。

    Xtrackers MSCI Emerging Markets ex China UCITS ETF 1C

    资产类别: 股票
    ISIN代码: IE00BM67HJ62
    复制方式: 实物复制
    收益处理: 累积(再投资)
    总费用比率(TER): 0.16%
    主要持仓:
        台积电(TSMC):14.0%
        三星电子:3.2%
        HDFC银行:2.2%
    数据截至: 2024年12月13日
  5. 主动型ETF

    结合两种投资策略的优势**

    ETF通常是被动管理的,主要追踪指数的表现。衡量ETF表现的一个关键指标是“跟踪误差”(Tracking Error),即其与目标指数之间的偏差。

    但近年来,市场上也出现了主动管理型ETF,它们比传统的主动型基金成本更低。例如,iShares Growth Portfolio UCITS ETF的总费用比率(TER)仅0.25%。

    该ETF采用多资产配置策略,主要投资于其他ETF,涵盖全球范围内的数千只股票和债券。同时,该基金要求至少80%的持仓满足ESG(环境、社会和治理)标准。

DNAseq 2025_AYE and DNAseq 2025_adeABadeIJ_adeIJK_CM1_CM2

  1. Targets

    Could you please help me to process these data (Project: X101SC25015922-Z01-J001)?
    For you information,
    1. Please compare the data with the AYE strain (CU459141) across the following conditions:
    a) AYE-S
    b) AYE-Q
    c) AYE-WT on Tig4
    d) AYE-craA on Tig4
    e) AYE-craA-1 on Cm200
    f) AYE-craA-2 on Cm200
    2. The "clinical" sample refers to a clinical isolate of Acinetobacter baumannii. I’m unsure which reference genome would be most appropriate for comparison in this case. Can we use lab strains (CP059040, CU459141, and CP079931) as reference genome for comparison?
    
    Processed the genome sequence for project X101SC24115801-Z01-J001?
    1. Kindly compare the data with the ATCC 19606 strain (CP059040) under the following conditions:
    a) adeABadeIJ (knockout of adeA, adeB, adeI, and adeJ, please confirm whether these genes are successfully knocked out.)
    b) adeIJK (knockout of adeI, adeJ, and adeK, please confirm whether these genes are successfully knocked out.)
    c) CM1
    d) CM2
    The "HF" sample may also refer to a clinical isolate of Enterobacter hormaechei.
    2. The "HF" sample refers to a clinical isolate of Acinetobacter baumannii. I’m unsure which reference genome would be most appropriate for comparison in this case. Can we use lab strains (CP059040, CU459141, and CP079931) as reference genome for comparison?

Project Data_Tam_DNAseq_2025_AYE

  1. Download the raw data.

    86e4016c902a1cd23a2190415425e641  01.RawData/AYE-WTonTig4/AYE-WTonTig4_1.fq.gz
    554eb44ae261312039929f0991582111  01.RawData/AYE-WTonTig4/AYE-WTonTig4_2.fq.gz
    ce004b0d7135bce80f34bd6bac3e89e7  01.RawData/AYE-Q/AYE-Q_1.fq.gz
    bddc7ced051a2167a5a8341332d7423a  01.RawData/AYE-Q/AYE-Q_2.fq.gz
    227d93b8a762185d5dcd1e4975041491  01.RawData/AYE-S/AYE-S_1.fq.gz
    f098c9a8579bf5729427dc871225a290  01.RawData/AYE-S/AYE-S_2.fq.gz
    78e08dd090d89330b1021ce42fb09baa  01.RawData/clinical/clinical_1.fq.gz
    2346fef1d896ef0924d2ec88db51cade  01.RawData/clinical/clinical_2.fq.gz
    4c07494505caf22f70edb54692bcaca2  01.RawData/AYE-craA-1onCm200/AYE-craA-1onCm200_1.fq.gz
    52944e395004dc11758d422690bda168  01.RawData/AYE-craA-1onCm200/AYE-craA-1onCm200_2.fq.gz
    92b498ed7465645ca00bbc945c514fe2  01.RawData/AYE-craAonTig4/AYE-craAonTig4_1.fq.gz
    fd9d670942973e6760d6dd78f4ee852a  01.RawData/AYE-craAonTig4/AYE-craAonTig4_2.fq.gz
    375f1e3efb60571ffd457b3cb1e64a84  01.RawData/AYE-craA-2onCm200/AYE-craA-2onCm200_1.fq.gz
    041c08f4c45f1fabd129fc10500c6582  01.RawData/AYE-craA-2onCm200/AYE-craA-2onCm200_2.fq.gz
    c129aa9a208ca47db10bb04e54c096d7  02.Report_X101SC25015922-Z01-J001.zip
    
    md5sum 01.RawData/AYE-WTonTig4/AYE-WTonTig4_1.fq.gz > MD5.txt_
    md5sum 01.RawData/AYE-WTonTig4/AYE-WTonTig4_2.fq.gz >> MD5.txt_
    md5sum 01.RawData/AYE-Q/AYE-Q_1.fq.gz >> MD5.txt_
    md5sum 01.RawData/AYE-Q/AYE-Q_2.fq.gz >> MD5.txt_
    md5sum 01.RawData/AYE-S/AYE-S_1.fq.gz >> MD5.txt_
    md5sum 01.RawData/AYE-S/AYE-S_2.fq.gz >> MD5.txt_
    md5sum 01.RawData/clinical/clinical_1.fq.gz >> MD5.txt_
    md5sum 01.RawData/clinical/clinical_2.fq.gz >> MD5.txt_
    md5sum 01.RawData/AYE-craA-1onCm200/AYE-craA-1onCm200_1.fq.gz >> MD5.txt_
    md5sum 01.RawData/AYE-craA-1onCm200/AYE-craA-1onCm200_2.fq.gz >> MD5.txt_
    md5sum 01.RawData/AYE-craAonTig4/AYE-craAonTig4_1.fq.gz >> MD5.txt_
    md5sum 01.RawData/AYE-craAonTig4/AYE-craAonTig4_2.fq.gz >> MD5.txt_
    md5sum 01.RawData/AYE-craA-2onCm200/AYE-craA-2onCm200_1.fq.gz >> MD5.txt_
    md5sum 01.RawData/AYE-craA-2onCm200/AYE-craA-2onCm200_2.fq.gz >> MD5.txt_
    md5sum 02.Report_X101SC25015922-Z01-J001.zip >> MD5.txt_
    
    ce004b0d7135bce80f34bd6bac3e89e7  AYE-Q_1.fq.gz
    bddc7ced051a2167a5a8341332d7423a  AYE-Q_2.fq.gz

Data process according to http://xgenes.com/article/article-content/325/analysis-of-snps-indels-transposons-and-is-elements-in-5-a-baumannii-strains/

  1. Call variant calling using snippy

    ln -s ~/Tools/bacto/db/ .;
    ln -s ~/Tools/bacto/envs/ .;
    ln -s ~/Tools/bacto/local/ .;
    cp ~/Tools/bacto/Snakefile .;
    cp ~/Tools/bacto/bacto-0.1.json .;
    cp ~/Tools/bacto/cluster.json .;
    
    mkdir raw_data; cd raw_data;
    
    # Note that the names must be ending with fastq.gz
    ln -s ../X101SC25015922-Z01-J001/01.RawData/AYE-S/AYE-S_1.fq.gz AYE-S_R1.fastq.gz
    ln -s ../X101SC25015922-Z01-J001/01.RawData/AYE-S/AYE-S_2.fq.gz AYE-S_R2.fastq.gz
    ln -s ../X101SC25015922-Z01-J001/01.RawData/AYE-Q/AYE-Q_1.fq.gz AYE-Q_R1.fastq.gz
    ln -s ../X101SC25015922-Z01-J001/01.RawData/AYE-Q/AYE-Q_2.fq.gz AYE-Q_R2.fastq.gz
    ln -s ../X101SC25015922-Z01-J001/01.RawData/AYE-WTonTig4/AYE-WTonTig4_1.fq.gz AYE-WT_on_Tig4_R1.fastq.gz
    ln -s ../X101SC25015922-Z01-J001/01.RawData/AYE-WTonTig4/AYE-WTonTig4_2.fq.gz AYE-WT_on_Tig4_R2.fastq.gz
    
    ln -s ../X101SC25015922-Z01-J001/01.RawData/AYE-craAonTig4/AYE-craAonTig4_1.fq.gz AYE-craA_on_Tig4_R1.fastq.gz
    ln -s ../X101SC25015922-Z01-J001/01.RawData/AYE-craAonTig4/AYE-craAonTig4_2.fq.gz AYE-craA_on_Tig4_R2.fastq.gz
    ln -s ../X101SC25015922-Z01-J001/01.RawData/AYE-craA-1onCm200/AYE-craA-1onCm200_1.fq.gz AYE-craA-1_on_Cm200_R1.fastq.gz
    ln -s ../X101SC25015922-Z01-J001/01.RawData/AYE-craA-1onCm200/AYE-craA-1onCm200_2.fq.gz AYE-craA-1_on_Cm200_R2.fastq.gz
    ln -s ../X101SC25015922-Z01-J001/01.RawData/AYE-craA-2onCm200/AYE-craA-2onCm200_1.fq.gz AYE-craA-2_on_Cm200_R1.fastq.gz
    ln -s ../X101SC25015922-Z01-J001/01.RawData/AYE-craA-2onCm200/AYE-craA-2onCm200_2.fq.gz AYE-craA-2_on_Cm200_R2.fastq.gz
    
    ln -s ../X101SC25015922-Z01-J001/01.RawData/clinical/clinical_1.fq.gz clinical_R1.fastq.gz
    ln -s ../X101SC25015922-Z01-J001/01.RawData/clinical/clinical_2.fq.gz clinical_R2.fastq.gz
    
    #download CU459141.gb from GenBank
    mv ~/Downloads/sequence\(1\).gb db/CU459141.gb
    #setting the following in bacto-0.1.json
    
        "fastqc": false,
        "taxonomic_classifier": false,
        "assembly": true,
        "typing_ariba": false,
        "typing_mlst": true,
        "pangenome": true,
        "variants_calling": true,
        "phylogeny_fasttree": true,
        "phylogeny_raxml": true,
        "recombination": false, (due to gubbins-error set false)
    
        "genus": "Acinetobacter",
        "kingdom": "Bacteria",
        "species": "baumannii",  (in both prokka and mykrobe)
        "reference": "db/CU459141.gb"
    
    mamba activate /home/jhuang/miniconda3/envs/bengal3_ac3
    (bengal3_ac3) /home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
    
    #check if we need big calculation for including the clinical sample by checking mlst. TODO: send the mlst results to Tam. Next step by check vrap which complete isolate?
  2. Download all S epidermidis genomes and identified all ST2 isolates from them!

    #Acinetobacter baumannii Taxonomy ID: 470
    #esearch -db nucleotide -query "txid470[Organism:exp]" | efetch -format fasta -email j.huang@uke.de > genome_470_ncbi.fasta
    #python ~/Scripts/filter_fasta.py genome_470_ncbi.fasta complete_genome_470_ncbi.fasta  #
    
    # ---- Download related genomes from ENA ----
    https://www.ebi.ac.uk/ena/browser/view/470
    #Click "Sequence" and download "Counts" (13059) and "Taxon descendants count" (16091) if there is enough time! Downloading time points is 28.02.2025.
    python ~/Scripts/filter_fasta.py  ena_470_sequence.fasta complete_genome_470_ena_taxon_descendants_count.fasta  #16091-->920
    #python ~/Scripts/filter_fasta.py ena_470_sequence_Counts.fasta complete_genome_470_ena_Counts.fasta  #xxx, 5.8G
  3. Run vrap

    #replace --virus to the specific taxonomy (e.g. Acinetobacter baumannii) --> change virus_user_db --> specific_bacteria_user_db
    mv trimmed/clinical_trimmed*.fastq .
    gzip *.fastq
    ln -s ~/Tools/vrap/ .
    mamba activate /home/jhuang/miniconda3/envs/vrap
    vrap/vrap.py  -1 clinical_trimmed_P_1.fastq.gz -2 clinical_trimmed_P_2.fastq.gz -o vrap_clinical --bt2idx=/home/jhuang/REFs/genome --host=/home/jhuang/REFs/genome.fa --virus=/home/jhuang/DATA/Data_Tam_DNAseq_2025_AYE/complete_genome_470_ena_taxon_descendants_count.fasta --nt=/mnt/nvme1n1p1/blast/nt --nr=/mnt/nvme1n1p1/blast/nr  -t 100 -l 200  -g
  4. Run second run of bengal3_ac3 without the clinical sample

    mkdir results_with_clinical
    mv variants results_with_clinical
    mv roary results_with_clinical
    mv fasttree results_with_clinical
    mv raxml-ng results_with_clinical
    mv snippy/clinical/ results_with_clinical/clinical_under_snippy
    rm raw_data/clinical_*.fastq.gz
    rm fastq/clinical_*.fastq
    (bengal3_ac3) /home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds

Variant calling against AYE

  1. Using spandx calling variants (almost the same results to the one from viral-ngs!)

    mkdir ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/CU459141
    cp db/CU459141.gb  ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/CU459141/genes.gbk
    vim ~/miniconda3/envs/spandx/share/snpeff-5.1-2/snpEff.config
    #CU459141.chromosome : CU459141
    /home/jhuang/miniconda3/envs/spandx/bin/snpEff build -c ~/miniconda3/envs/spandx/share/snpeff-5.1-2/snpEff.config -genbank CU459141 -v   #-d
    #00:00:02 Saving sequences for chromosome 'CU459141.1 ...

    mamba activate /home/jhuang/miniconda3/envs/bengal3_ac3 /home/jhuang/miniconda3/envs/bengal3_ac3/bin/snpEff build -c ~/miniconda3/envs/spandx/share/snpeff-5.1-2/snpEff.config -genbank CU459141 -v

    ~/Scripts/genbank2fasta.py CU459141.gb
    mv CU459141.gb_converted.fna CU459141.fasta    #rename "CU459141.1 xxxxx" to "CU459141" in the fasta-file
    
    ln -s /home/jhuang/Tools/spandx/ spandx
    mamba activate /home/jhuang/miniconda3/envs/spandx
    (spandx) nextflow run spandx/main.nf --fastq "trimmed/*_P_{1,2}.fastq" --ref CU459141.fasta --annotation --database CU459141 -resume
    
    #DEBUG "ERROR_CHROMOSOME_NOT_FOUND"
    cd Outputs/Master_vcf
    sed 's/^CU459141/CU459141.1/' out.filtered.vcf > modified_input.vcf
    /home/jhuang/miniconda3/envs/spandx/bin/snpEff  ann -chr CU459141.1 out.filtered.vcf > annotated_variants.vcf
    /home/jhuang/miniconda3/envs/spandx/bin/snpEff databases
  2. Summarize all SNPs and Indels from the snippy result directory.

    #Output: snippy_CP059040/snippy/summary_snps_indels.csv
    # IMPORTANT_ADAPT the array isolates = ["AYE-S", "AYE-Q", "AYE-WT on Tig4", "AYE-craA on Tig4", "AYE-craA-1 on Cm200", "AYE-craA-2 on Cm200"]
    python3 ~/Scripts/summarize_snippy_res.py snippy
    grep -v "None,,,,,,None,None" summary_snps_indels.csv > summary_snps_indels_.csv
  3. Additonal steps for DEBUG #ERROR_CHROMOSOME_NOT_FOUND in the generated SPANDx result: Rerun SNP_matrix.sh with /home/jhuang/miniconda3/envs/bengal3_ac3/bin/snpEff, rather than the snpEff from its own environment spandx!

    # -- The syntax of /home/jhuang/miniconda3/envs/bengal3_ac3/bin/snpEff is different (see below) --
    #cd ~/DATA/Data_Tam_DNAseq_2025_AYE/snippy/AYE-Q/reference
    ### snpEff build -c reference/snpeff.config -dataDir . -gff3 ref
    #ref.chromosome : CU459141
    ### snpEff ann -noLog -noStats -no-downstream -no-upstream -no-utr -c reference/snpeff.config -dataDir . ref AYE-Q.filt.vcf > AYE-Q.vcf
    
    #(OUTDATED!)  #under Outputs
    #  (bengal3_ac3) cp -r ~/DATA/Data_Tam_DNAseq_2025_AYE/snippy/AYE-Q/reference .
    #  for sample in AYE-craA-1_on_Cm200_trimmed
    #  #(bengal3_ac3) snpEff ann -noLog -noStats -no-downstream -no-upstream -no-utr -c reference/snpeff.config -dataDir . ref out.filtered.vcf > out.filtered_.vcf
    #  for sample in AYE-S_trimmed.PASS.indels AYE-S_trimmed.PASS.snps  AYE-Q_trimmed.PASS.indels AYE-Q_trimmed.PASS.snps  AYE-WT_on_Tig4_trimmed.PASS.indels AYE-WT_on_Tig4_trimmed.PASS.snps  AYE-craA_on_Tig4_trimmed.PASS.indels AYE-craA_on_Tig4_trimmed.PASS.snps  AYE-craA-1_on_Cm200_trimmed.PASS.indels AYE-craA-1_on_Cm200_trimmed.PASS.snps AYE-craA-2_on_Cm200_trimmed.PASS.indels AYE-craA-2_on_Cm200_trimmed.PASS.snps; do
    #          snpEff ann -noLog -noStats -no-downstream -no-upstream -no-utr -c reference/snpeff.config -dataDir . ref Variants/VCFs/${sample}.vcf > Variants/Annotated/${sample}.annotated.vcf
    #  done
    
    #(WORKS!) Under Outputs/Master_vcf and (spandx)
    (spandx) cp -r ../../snippy/AYE-Q/reference .
    (spandx) cp ../../spandx/bin/SNP_matrix.sh ./
    #Note that ${variant_genome_path}=CU459141
    #Adapt "snpEff eff -no-downstream -no-intergenic -ud 100 -formatEff -v ${variant_genome_path} out.vcf > out.annotated.vcf" to
    "/home/jhuang/miniconda3/envs/bengal3_ac3/bin/snpEff eff -no-downstream -no-intergenic -ud 100 -formatEff -c reference/snpeff.config -dataDir . ref out.vcf > out.annotated.vcf" in SNP_matrix.sh
    (spandx) bash SNP_matrix.sh CU459141 .
  4. Merge the two files

    #grep -v "/" All_SNPs_indels_annotated.txt > All_SNPs_indels_annotated_.txt
    cut -d$'\t' -f2 All_SNPs_indels_annotated.txt > ../../id1
    cut -d',' -f2 summary_snps_indels_.csv > ../id2
    diff id1 id2
    
    #(OUTDATED!)  Merging summary_snps_indels_.csv (Second half part) and All_SNPs_indels_annotated_.txt (First half part) by pasting the results from two variant calling methods snippy and spandx
    #  cp Outputs/Phylogeny_and_annotation/All_SNPs_indels_annotated_.txt merged_variants.csv
    #  #The common positons are "169174" and "3199991"
    #  grep "CHROM" All_SNPs_indels_annotated_.txt > All_SNPs_indels_annotated__.txt
    #  grep "169174" All_SNPs_indels_annotated_.txt >> All_SNPs_indels_annotated__.txt
    #  grep "3199991" All_SNPs_indels_annotated_.txt >> All_SNPs_indels_annotated__.txt
    #  cut -d$'\t' -f1-11 All_SNPs_indels_annotated__.txt > All_SNPs_indels_annotated___.txt
    #  sed -i '1s/_trimmed//g' All_SNPs_indels_annotated___.txt
    #  cp All_SNPs_indels_annotated___.txt ../../
    #  grep "CHROM" summary_snps_indels_.csv > summary_snps_indels__.csv
    #  grep "169174" summary_snps_indels_.csv >> summary_snps_indels__.csv
    #  grep "3199991" summary_snps_indels_.csv >> summary_snps_indels__.csv
    #  cut -d$',' -f11- summary_snps_indels__.csv > summary_snps_indels___.csv
    #  sed -i 's/,/\t/g' summary_snps_indels___.csv
    #  sed -i 's/ /\t/g' summary_snps_indels___.csv
    #  paste Outputs/Phylogeny_and_annotation/All_SNPs_indels_annotated___.txt snippy/summary_snps_indels___.csv > merged_variants.csv
    
    #The common positons are "169174" and "3199991"
    grep "CHROM" All_SNPs_indels_annotated.txt > All_SNPs_indels_annotated__.txt
    grep "169174" All_SNPs_indels_annotated.txt >> All_SNPs_indels_annotated__.txt
    grep "3199991" All_SNPs_indels_annotated.txt >> All_SNPs_indels_annotated__.txt
    
    #DEBUG
    CU459141        3199991 CTGT    C       INDEL   CTGT    CTGT    CTGT    CTGT    CTGT    C       disruptive_inframe_deletion     MODERATE                gttgct/gct      p.Val8del/c.23_25delTTG 86      ABAYE3157       protein_coding
    -->
    CU459141        3199991 GCTGTT    GCT       INDEL   GCTGTT    GCTGTT    GCTGTT    GCTGTT    GCTGTT    GCT       disruptive_inframe_deletion     MODERATE                gctgtt/gct      p.Val8del/c.23_25delTGT 86      ABAYE3157       protein_coding
    
    gene            3199972..3200232
                    /locus_tag="ABAYE3157"
    CDS             3199972..3200232
                    /locus_tag="ABAYE3157"
                    /inference="ab initio prediction:AMIGene:2.0"
                    /note="Evidence 4 : Homologs of previously reported genes
                    of unknown function"
                    /codon_start=1
                    /transl_table=11
                    /product="conserved hypothetical protein; putative
                    exported protein"
                    /protein_id="CAM87966.1"
                    /db_xref="EnsemblGenomes-Gn:ABAYE3157"
                    /db_xref="EnsemblGenomes-Tr:CAM87966"
                    /db_xref="UniProtKB/TrEMBL:B0V5Q3"
                    /translation="MLFKKFAVAAAVAATLAFVGCSKKEEAPAANAASEATAAASEAT
                    AAASEAATAASQAVDAAASEAAASTAAASAPAASEAAASAAH"
    
    >CU459141:3199972-3200232
    ATGTTATTCAAAAAATTC GCTGTT GCTGCTGCTGTTGCTGCTACTTTAGCTTTCGTTGGTTGTTCTAAGA
    AAGAAGAAGCTCCTGCTGCTAACGCTGCATCTGAAGCTACTGCTGCTGCATCTGAAGCTACTGCTGCTGC
    ATCTGAAGCTGCTACTGCTGCTTCTCAAGCTGTTGATGCTGCTGCATCTGAAGCTGCTGCTTCTACTGCT
    GCCGCTTCTGCACCTGCCGCTTCTGAAGCTGCTGCTTCTGCTGCACACTAA
  5. (Optional–>may error–>ignoring the step! since it should not happed) filter rows without change (between REF and the isolates) in snippy/summary_snpsindels.csv (94)

    awk -F, '
    NR == 1 {print; next}  # Print the header line
    {
        ref = $3;  # Reference base (assuming REF is in the 3rd column)
        same = 1;  # Flag to check if all bases are the same as reference
        samples = "";  # Initialize variable to hold sample data
        for (i = 6; i <= NF - 8; i++) {  # Loop through all sample columns, adjusting for the number of fixed columns
            samples = samples $i " ";  # Collect sample data
            if ($i != ref) {
                same = 0;
            }
        }
        if (!same) {
            print $0;  # Print the entire line if not all bases are the same as the reference
            #print "Samples: " samples;  # Print all sample data for checking
        }
    }
    ' merged_variants.csv > merged_variants_.csv
    #Explanation:
    #    -F, sets the field separator to a comma.
    #    NR == 1 {print; next} prints the header line and skips to the next line.
    #    ref = $3 sets the reference base (assumed to be in the 3rd column).
    #    same = 1 initializes a flag to check if all sample bases are the same as the reference.
    #    samples = ""; initializes a string to collect sample data.
    #    for (i = 6; i <= NF - 8; i++) loops through the sample columns. This assumes the first 5 columns are fixed (CHROM, POS, REF, ALT, TYPE), and the last 8 columns are fixed (Effect, Impact, Functional_Class, Codon_change, Protein_and_nucleotide_change, Amino_Acid_Length, Gene_name, Biotype).
    #    samples = samples $i " "; collects sample data.
    #    if ($i != ref) { same = 0; } checks if any sample base is different from the reference base. If found, it sets same to 0.
    #    if (!same) { print $0; print "Samples: " samples; } prints the entire line and the sample data if not all sample bases are the same as the reference.
  6. (Optional) Draw local genetic environments of △adeIJ

    see http://xgenes.com/article/article-content/325/analysis-of-snps-indels-transposons-and-is-elements-in-5-a-baumannii-strains/
  7. (Optional) Show difference between the genome using BRIG (show no differences by images!)

    This is an issue with java version. I was running Java 1.8 version and had same problem as the OP's question. Then I followed the answers and finally changing the java version worked for me.
    I ran the following steps:
    1) Go to link: https://www.oracle.com/java/technologies/javase-java-archive-javase6-downloads.html
    2) mkdir Java1.6
    3) Download "jdk-6u45-linux-x64.bin" and save in Java1.6
    3) cd Java1.6 && chmod +x jdk-6u45-linux-x64.bin && ./jdk-6u45-linux-x64.bin
    4) You should see java file in Java1.6/jdk1.6.0_45/bin
    5) Java1.6/jdk1.6.0_45/bin/java -version - should give "1.6.0_45"
    6) (base) conda install bioconda::blast-legacy  # install blastall 2.2.26
    7) #set lastLocation="/home/jhuang/miniconda3/bin/" in default-BRIG.xml --> /home/jhuang/miniconda3/bin/blastall is used for blast!
    8) Open BRIG using Java1.6 as follows: ~/Tools/Java1.6/jdk1.6.0_45/bin/java -Xmx15000M -jar BRIG.jar
    Note: I did not uninstall my original Java1.8 rather used the downloded java1.6 execulatable file using path. That way I retain both versions without any problem.
    This gave me all the rings without any issue!! Hope this helps someone!
    Users who wish to run BRIG from the command-line need to:
    * Navigate to the unpacked BRIG folder in a command-line interface (terminal, console, command prompt).
    * Run 'java -Xmx1500M -jar BRIG.jar'. Where -Xmx specifies the amount of memory allocated to BRIG.
    Note:
    * BLAST legacy comes as a compressed package, which will unzip the BLAST binaries where ever the package is. We advise users to first create a BLAST directory (in either the home or applications directory), copy the downloaded BLAST package to that directory and unzip the package.
    * BRIG supports both BLAST+ & BLAST Legacy. If BRIG cannot find BLAST it will prompt users at runtime. Users can specify the location of their BLAST installation in the BRIG options menu which is: Main window > Preferences > BRIG options.
    * N.B: If BOTH BLAST+ and legacy versions are in the same location, BRIG will prefer BLAST+
    convert CP059040.png -crop 4000x2780+3840+0 CP059040_.png

TODO: Variant calling against xxxx (found in vrap results)!

Project Data_Tam_DNAseq_2025_adeABadeIJ_adeIJK_CM1_CM2

    #ln -s ../X101SC24115801-Z01-J001/01.RawData/HF/HF_1.fq.gz HF_R1.fastq.gz
    #ln -s ../X101SC24115801-Z01-J001/01.RawData/HF/HF_2.fq.gz HF_R2.fastq.gz

    (bengal3_ac3) /home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds
    #HF is Enterobacter cloacae (550) or Enterobacter hormaechei (158836)

    # ---- Download related genomes from ENA ----
    https://www.ebi.ac.uk/ena/browser/view/550
    #Click "Sequence" and download "Counts" (7263) and "Taxon descendants count" (8004) if there is enough time! Downloading time points is 28.02.2025.
    python ~/Scripts/filter_fasta.py  ena_550_sequence.fasta complete_genome_550_ena_taxon_descendants_count.fasta  #8004-->100
    https://www.ebi.ac.uk/ena/browser/view/158836
    #Click "Sequence" and download "Counts" (3763) and "Taxon descendants count" (4846) if there is enough time! Downloading time points is 28.02.2025.
    python ~/Scripts/filter_fasta.py  ena_158836_sequence.fasta complete_genome_158836_ena_taxon_descendants_count.fasta  #4846-->540
    cat complete_genome_158836_ena_taxon_descendants_count.fasta complete_genome_550_ena_taxon_descendants_count.fasta > complete_genome_158836_550.fasta
    #check if the flanking line is correct
    grep "ENA|AP022130|AP022130.1" complete_genome_158836_550.fasta
    #>ENA|AP022130|AP022130.1 Enterobacter cloacae plasmid pWP5-S18-CRE-02_4 DNA, complete genome, strain: WP5-S18-CRE-02.

    mv trimmed/HF_trimmed*.fastq .
    gzip *.fastq
    ln -s ~/Tools/vrap/ .
    mamba activate /home/jhuang/miniconda3/envs/vrap
    vrap/vrap.py  -1 HF_trimmed_P_1.fastq.gz -2 HF_trimmed_P_2.fastq.gz -o vrap_HF --bt2idx=/home/jhuang/REFs/genome --host=/home/jhuang/REFs/genome.fa --virus=/home/jhuang/DATA/Data_Tam_DNAseq_2025_adeABadeIJ_adeIJK_CM1_CM2/complete_genome_158836_550.fasta --nt=/mnt/nvme1n1p1/blast/nt --nr=/mnt/nvme1n1p1/blast/nr  -t 100 -l 200  -g

    # Variant calling against ATCC 19606
    mkdir results_with_HF
    mv variants results_with_HF
    mv roary results_with_HF
    mv fasttree results_with_HF
    mv raxml-ng results_with_HF
    mv snippy/HF/ results_with_HF/HF_under_snippy
    rm raw_data/HF_*.fastq.gz
    rm fastq/HF_*.fastq
    (bengal3_ac3) /home/jhuang/miniconda3/envs/snakemake_4_3_1/bin/snakemake --printshellcmds

    mkdir ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/CP059040
    cp db/CP059040.gb  ~/miniconda3/envs/spandx/share/snpeff-5.1-2/data/CP059040/genes.gbk
    vim ~/miniconda3/envs/spandx/share/snpeff-5.1-2/snpEff.config
    /home/jhuang/miniconda3/envs/spandx/bin/snpEff build CP059040    #-d
    ~/Scripts/genbank2fasta.py CP059040.gb
    mv CP059040.gb_converted.fna CP059040.fasta    #rename "CP059040.1 xxxxx" to "CP059040" in the fasta-file

    ln -s /home/jhuang/Tools/spandx/ spandx
    mamba activate /home/jhuang/miniconda3/envs/spandx
    (spandx) nextflow run spandx/main.nf --fastq "trimmed/*_P_{1,2}.fastq" --ref CP059040.fasta --annotation --database CP059040 -resume

    python3 ~/Scripts/summarize_snippy_res.py snippy
    grep -v "None,,,,,,None,None" summary_snps_indels.csv > summary_snps_indels_.csv

    cd Outputs/Master_vcf
    (spandx) cp -r ../../snippy/CM1/reference .
    (spandx) cp ../../spandx/bin/SNP_matrix.sh ./
    #Adapt "snpEff eff -no-downstream -no-intergenic -ud 100 -formatEff -v ${variant_genome_path} out.vcf > out.annotated.vcf" to
    "/home/jhuang/miniconda3/envs/bengal3_ac3/bin/snpEff eff -no-downstream -no-intergenic -ud 100 -formatEff -c reference/snpeff.config -dataDir . ref out.vcf > out.annotated.vcf" in SNP_matrix.sh
    (spandx) bash SNP_matrix.sh CP059040 .

Supplementary: Enterobacter cloacae (taxid550) vs. Enterobacter hormaechei (taxid158836)

    🔬 介绍
    阴沟肠杆菌(Enterobacter cloacae) 和 霍尔马氏肠杆菌(Enterobacter hormaechei) 都属于 肠杆菌科(Enterobacteriaceae),是革兰氏阴性、兼性厌氧的杆状细菌。它们广泛存在于 环境中(如水、土壤、植物) 以及 人类和动物的肠道 中。

Report

    Please find attached the SNP and Indel calling results for the project DNAseq_2025_AYE against CU459141, as well as for the project DNAseq_2025_adeABadeIJ_adeIJK_CM1_CM2 against CP059040.

    Additionally, I have attached the analysis for the two clinical samples: clinical (in DNAseq_2025_AYE) and HF (in DNAseq_2025_adeABadeIJ_adeIJK_CM1_CM2). You can find the details in the files clinical_summary.xlsx and HF_summary.xlsx. The closest species found in the database are "Acinetobacter baumannii strain 2024CK-00130" and "Enterobacter hormaechei strain K528". Would you like to have the SNP and Indel analysis for the clinical samples compared to these closely related species as the reference?

Submit ChIP-seq raw data to www.ebi.ac.uk/arrayexpress (Project E-MTAB-10475)

https://www.ebi.ac.uk/fg/annotare/login/

  1. General Information

    Title: ChIP-seq of primary human macrophages uninfected or infected with Yersinia enterocolitica strains
    
    Description: Pathogenic bacteria Yersinia enterocolitica injects virulence plasmid-encoded effectors through the type three secretion system into macrophages to modulate gene expression. At this point it is not known whether epigenetic modifications play a role in Yersinia regulation of gene expression. To answer this question primary human macrophages were infected with mock, WAC (virulence plasmid-cured strain) or WA314 (wild type) and samples were subjected to ChIP-seq for H3K4me3, H3K4me1, H3K27ac and H3K27me3. The effect of effector proteins YopM and YopP on histone modifications in macrophages was analyzed using a wild type strain lacking either YopM or YopP and subsequent ChIP-seq analysis.
    
    Experiment Type: 4C, antigen profiling, ATAC-seq, Bisulfite-seq, Capture-C, ChIP-chip by array, ChIP-chip by SNP array, ChIP-chip by tiling array, ChIP-seq*, CLIP-seq, comparative genomic hybridization by array, CUT&RUN, DNA-seq, exome sequencing, FAIRE-seq, genotyping by array, genotyping by high throughput sequencing, GRO-seq, Hi-C, MBD-seq, MeDIP-seq, methylation profiling by array, methylation profiling by high throughput sequencing, microRNA profiling by array, microRNA profiling by high throughput sequencing, MNase-seq, MRE-seq, proteomic profiling by array, Ribo-seq, RIP-chip by array, RIP-seq, RNA-seq of coding RNA, RNA-seq of coding RNA from single cells, RNA-seq of non coding RNA, RNA-seq of non coding RNA from single cells, RNA-seq of total RNA, scATAC-seq, single nucleus RNA sequencing, spatial transcriptomics by high-throughput sequencing, tiling path by array, transcription profiling by array, transcription profiling by RT-PCR
    
    Experimental Designs: cell type comparison design; stimulus or stress design
    2019-01-01
  2. Contacts

    xx
    x.x@uke.de
    Institute of Medical Microbiology, Virology and Hygiene, University Medical Center Hamburg-Eppendorf (UKE)
    Martinistraße 52, 20246 Hamburg, Germany
    
    data analyst
    experiment performer
    submitter
    
    yy
    y.y@uke.de
    investigator
  3. Publications

  4. Create samples, add attributes and experimental variables

    #Header: Name Organism CellType Stimulus MateiralType Immunoprecipitate
    
    Sample 1
    Homo sapiens
    macrophage
    Y. enterocolitica WA314deltaYopM
    DNA
    H3K27ac
    
    Sample 2
    Homo sapiens
    macrophage
    Y. enterocolitica WA314deltaYopM
    DNA
    H3K27ac
    
    Sample 3
    Homo sapiens
    macrophage
    Y. enterocolitica WA314deltaYopP
    DNA
    H3K27ac
    
    Sample 4
    Homo sapiens
    macrophage
    Y. enterocolitica WA314deltaYopP
    DNA
    H3K27ac
    Sample 5
    Homo sapiens
    macrophage
    mock
    DNA
    H3K27ac
    Sample 6
    Homo sapiens
    macrophage
    mock
    DNA
    H3K27ac
    Sample 7
    Homo sapiens
    macrophage
    Y. enterocolitica WAC
    DNA
    H3K27ac
    Sample 8
    Homo sapiens
    macrophage
    Y. enterocolitica WAC
    DNA
    H3K27ac
    Sample 9
    Homo sapiens
    macrophage
    Y. enterocolitica WA314
    DNA
    H3K27ac
    Sample 10
    Homo sapiens
    macrophage
    Y. enterocolitica WA314
    DNA
    H3K27ac
    Sample 11
    Homo sapiens
    macrophage
    mock
    DNA
    H3K27me3
    Sample 12
    Homo sapiens
    macrophage
    mock
    DNA
    H3K27me3
    Sample 13
    Homo sapiens
    macrophage
    Y. enterocolitica WAC
    DNA
    H3K27me3
    Sample 14
    Homo sapiens
    macrophage
    Y. enterocolitica WAC
    DNA
    H3K27me3
    Sample 15
    Homo sapiens
    macrophage
    Y. enterocolitica WAC
    DNA
    H3K27me3
    Sample 16
    Homo sapiens
    macrophage
    Y. enterocolitica WAC
    DNA
    H3K27me3
    Sample 17
    Homo sapiens
    macrophage
    Y. enterocolitica WA314
    DNA
    H3K27me3
    Sample 18
    Homo sapiens
    macrophage
    Y. enterocolitica WA314
    DNA
    H3K27me3
    Sample 19
    Homo sapiens
    macrophage
    Y. enterocolitica WA314
    DNA
    H3K27me3
    Sample 20
    Homo sapiens
    macrophage
    Y. enterocolitica WA314
    DNA
    H3K27me3
    Sample 21
    Homo sapiens
    macrophage
    mock
    DNA
    H3K4me1
    Sample 22
    Homo sapiens
    macrophage
    mock
    DNA
    H3K4me1
    Sample 23
    Homo sapiens
    macrophage
    Y. enterocolitica WAC
    DNA
    H3K4me1
    Sample 24
    Homo sapiens
    macrophage
    Y. enterocolitica WAC
    DNA
    H3K4me1
    Sample 25
    Homo sapiens
    macrophage
    Y. enterocolitica WA314
    DNA
    H3K4me1
    Sample 26
    Homo sapiens
    macrophage
    Y. enterocolitica WA314
    DNA
    H3K4me1
    Sample 27
    Homo sapiens
    macrophage
    mock
    DNA
    H3K4me3
    Sample 28
    Homo sapiens
    macrophage
    mock
    DNA
    H3K4me3
    Sample 29
    Homo sapiens
    macrophage
    Y. enterocolitica WAC
    DNA
    H3K4me3
    Sample 30
    Homo sapiens
    macrophage
    Y. enterocolitica WAC
    DNA
    H3K4me3
    Sample 31
    Homo sapiens
    macrophage
    Y. enterocolitica WAC
    DNA
    H3K4me3
    Sample 32
    Homo sapiens
    macrophage
    Y. enterocolitica WAC
    DNA
    H3K4me3
    Sample 33
    Homo sapiens
    macrophage
    Y. enterocolitica WA314
    DNA
    H3K4me3
    Sample 34
    Homo sapiens
    macrophage
    Y. enterocolitica WA314
    DNA
    H3K4me3
    Sample 35
    Homo sapiens
    macrophage
    Y. enterocolitica WA314
    DNA
    H3K4me3
    Sample 36
    Homo sapiens
    macrophage
    Y. enterocolitica WA314
    DNA
    H3K4me3
    Sample 37
    Homo sapiens
    macrophage
    Y. enterocolitica WA314deltaYopM
    DNA
    H3K4me3
    Sample 38
    Homo sapiens
    macrophage
    Y. enterocolitica WA314deltaYopM
    DNA
    H3K4me3
    Sample 39
    Homo sapiens
    macrophage
    Y. enterocolitica WA314deltaYopP
    DNA
    H3K4me3
    Sample 40
    Homo sapiens
    macrophage
    Y. enterocolitica WA314deltaYopP
    DNA
    H3K4me3
    Sample 41
    Homo sapiens
    macrophage
    mock
    DNA
    H3K4me3
    Sample 42
    Homo sapiens
    macrophage
    mock
    DNA
    H3K4me3
    Sample 43
    Homo sapiens
    macrophage
    Y. enterocolitica WAC
    DNA
    H3K4me3
    Sample 44
    Homo sapiens
    macrophage
    Y. enterocolitica WAC
    DNA
    H3K4me3
    Sample 45
    Homo sapiens
    macrophage
    Y. enterocolitica WA314
    DNA
    H3K4me3
    Sample 46
    Homo sapiens
    macrophage
    Y. enterocolitica WA314
    DNA
    H3K4me3
  5. Assign ENA library information

    #Header: NameResize Library Layout *    Library Source *    Library Strategy *  Library Selection * Library Strand  Nominal Length  Nominal SDev    Orientation
    
    Sample 1
    SINGLE
    GENOMIC (Genomic DNA (includes PCR products from genomic DNA))
    ChIP-Seq (Direct sequencing of chromatin immunoprecipitates)
    ChIP (Chromatin immunoprecipitation)
    
    Sample 2
    SINGLE
    GENOMIC (Genomic DNA (includes PCR products from genomic DNA))
    ChIP-Seq (Direct sequencing of chromatin immunoprecipitates)
    ChIP (Chromatin immunoprecipitation)
  6. Describe protocols

    #Header:    Name    Assign protocols to samples Type    Description *   Performer **    Hardware ** Software
    
    Protocol 1
    Assign...
    Type: sample collection protocol
    Description: Human peripheral blood monocytes were isolated by centrifugation of heparinized blood in Ficoll. Monocytic cells were isolated with magnetic anti-CD14 antibody beads and an MS+ Separation Column (Miltenyi Biotec) according to the manufacturer’s instructions and seeded into 6-well plates at a density of 2 × 106 cells. Cells were cultured in RPMI containing 20% autologous serum at 37°C, 5% CO2, and 90% humidity. The medium was changed every three days until cells were differentiated into macrophages after 7 days. Macrophages were used for infection 1-2 weeks after the isolation
    
    Protocol 2
    Assign...
    nucleic acid extraction protocol
    For the ChIP with formaldehyde crosslinking, macrophages (~3-10 x 106 cells per condition) were washed once with warm PBS and incubated for 30 min at 37 °C with accutase (eBioscience, USA) to detach the cells. ChIP protocol steps were performed as described in (Günther et al., 2016, PMID: 26855283), except that BSA-blocked ChIP grade protein A/G magnetic beads (Thermo Fisher Scientific, USA) were added to the chromatin and antibody mixture and incubated for 2 h at 4 °C rotating to bind chromatin-antibody complexes. Samples were incubated for ~3 min with a magnetic stand to ensure attachment of beads to the magnet and mixed by pipetting during the wash steps.
    
    Protocol 3
    Assign...
    nucleic acid library construction protocol
    ChIP-seq libraries were constructed with 1-10 ng of ChIP DNA or input control as
    a starting material. Libraries were generated using the NEXTflex™ ChIP-Seq Kit
    (Bioo Scientific, USA) as per manufacturer’s recommendations. Concentrations of all
    samples were measured with a Qubit Fluorometer (Thermo Fisher Scientific, USA)
    and fragment length distribution of the final libraries was analysed with the DNA
    High Sensitivity Chip on an Agilent 2100 Bioanalyzer (Agilent Technologies, USA).
    
    Protocol 4
    Assign...
    nucleic acid sequencing protocol
    All samples were normalized to 2 nM and pooled equimolar. The library pool was sequenced on the NextSeq500 (Illumina, USA) with 1 x 75 bp and total at least ~18 million reads per sample.
    Technology Platform Next Generation Sequencing Heinrich Pette Institute, Leibniz Institute for Experimental Virology Martinistraße 52 20251 Hamburg
    NextSeq 500
  7. Assign data files

    #Header: NameResize Raw Data File
    Sample 1
    H3K27ac_dM_DoA.fastq.gz
    Sample 2
    H3K27ac_dM_DoB.fastq.gz
    Sample 3
    H3K27ac_dP_DoA.fastq.gz
    Sample 4
    H3K27ac_dP_DoB.fastq.gz
    Sample 5
    H3K27ac_mock_DoA.fastq.gz
    ...