Author Archives: gene_x

LiftOver: An Essential Utility for the Conversion of Genomic Coordinates

If you have genomic coordinates (like gene positions, SNP positions etc) in hg19 and want to convert them to hg38, you’d use what’s known as a “liftover”. The UCSC Genome Browser provides a tool specifically for this purpose called the UCSC LiftOver tool.

Here is a brief step-by-step process to use the UCSC LiftOver tool:

Format your data: Make sure your data is in BED format (a text file with at least the first three columns being chromosome, start position, and end position).

Go to the LiftOver tool: Visit the UCSC LiftOver tool at https://genome.ucsc.edu/cgi-bin/hgLiftOver.

Upload your file and select the appropriate parameters: Upload your BED file, choose “hg19” as the original genome, and “hg38” as the genome to convert to.

Submit and download the result: Click “submit” and download the resulting BED file, which will contain your original genomic coordinates converted from hg19 to hg38.

It’s important to note that not all genomic regions will have a direct equivalent between versions, so some data may be lost in the conversion.

If you need to do this conversion frequently or with large datasets, there are also command-line tools and scripts available to perform the liftover locally, such as the “CrossMap” tool or the “liftOver” utility provided by UCSC, which can be run from a Unix-like (Linux/Mac) command line.

If you wish to perform a liftover in Python, you can use the package pyliftover. This package provides a Python interface to use the precomputed chain files available from the UCSC Genome Browser. Here is a simple example of how to use this package:

First, you need to install the pyliftover package. You can do this with pip:

pip install pyliftover

Then, download the chain file for the conversion. For hg19 to hg38, you can download it from the UCSC Genome Browser:

wget http://hgdownload.cse.ucsc.edu/goldenpath/hg19/liftOver/hg19ToHg38.over.chain.gz

Now, you can use the package in Python:

from pyliftover import LiftOver

# Initialize the LiftOver object
lo = LiftOver('hg19', 'hg38')

# Convert a position (e.g., chr1:1000000)
new_pos = lo.convert_coordinate('chr1', 1000000)

print(new_pos)

In this code:

  1. We import the LiftOver class from the pyliftover package.
  2. We initialize a LiftOver object with the source and target genomes (‘hg19’ and ‘hg38’).
  3. We call the convert_coordinate method of the LiftOver object to convert the position. The convert_coordinate method takes two arguments: the name of the chromosome (as a string) and the 0-based position (as an integer).
  4. Finally, we print the converted position.

Please note that the convert_coordinate method returns a list of conversions because a position could potentially map to multiple positions in the new genome. Each conversion is a tuple, where the first element is the name of the new chromosome, the second element is the new position, the third element is the strand (‘+’ or ‘-‘), and the fourth element is the conversion ratio.

Also, keep in mind that not all positions can be converted. If a position cannot be converted, convert_coordinate will return an empty list.

In the process of lifting over, some regions might not get mapped from the old assembly to the new one. There are several reasons why this could happen:

  1. Assembly differences: The two assemblies (source and target) might have different representations of certain genomic regions. For example, some regions might have been rearranged, added, or removed in the newer assembly.

  2. Unmapped regions: Some regions in the older assembly might not have a clear equivalent in the newer assembly, and vice versa. This could happen if, for example, a region is represented differently in the two assemblies due to updates in the understanding of the genome structure.

  3. Quality of mapping: LiftOver uses chain files that represent alignments of large blocks of the genome. If a region in the old assembly does not align well with any region in the new assembly, it might not get mapped.

  4. LiftOver parameters: LiftOver has some parameters (like the minimum ratio of bases that must remap) that can affect which regions get mapped. If you set these parameters to be more stringent, some regions might not get mapped.

When using LiftOver, it’s important to check the unmapped regions (the tool usually provides a file with these regions) to see if they are important for your analysis. If they are, you might need to consider other strategies to include them, such as manual inspection or alternative mapping tools.

#Deleted in new
chr14   483848  483848

#Explain failure messages
Deleted in new:
    Sequence intersects no chains
Partially deleted in new:
    Sequence insufficiently intersects one chain
Split in new:
    Sequence insufficiently intersects multiple chains
Duplicated in new:
    Sequence sufficiently intersects multiple chains
Boundary problem:
    Missing start or end base in an exon

https://epd.expasy.org/epd/human/human_database.php?db=human

Identifying the Nearest Genomic Peaks within Defined Regions

To find the closest peaks in the genome regions defined by a bed file, you can use a tool like BEDTools. BEDTools provides a function closest which allows you to find the closest feature in a second file to each feature in a first file.

Here is an example of how you might use BEDTools to achieve this:

bedtools closest -a peaks.bed -b genome_regions.bed > closest_peaks.bed

In this command:

  • -a peaks.bed specifies the input file with the peaks.
  • -b genome_regions.bed specifies the input file with the genome regions.
  • > closest_peaks.bed redirects the output to a file named closest_peaks.bed. The resulting closest_peaks.bed file will contain the closest peak from peaks.bed for each region in genome_regions.bed.

Please note that you’ll need to have BEDTools installed to use this command.

If you’re looking to do this operation in Python, you might want to look into libraries like pybedtools which provide a Python interface to BEDTools. Using pybedtools, the command would look something like this:

import pybedtools

peaks = pybedtools.BedTool('peaks.bed')
genome_regions = pybedtools.BedTool('genome_regions.bed')

closest_peaks = peaks.closest(genome_regions)

closest_peaks.saveas('closest_peaks.bed')

This script does the same thing as the previous bash command but in Python.

Remember that BEDTools requires your BED files to be sorted, and the bedtools closest command requires both input files to be sorted in the same order. So if you sort your peaks file, you should also sort your genome regions file in the same way, and vice versa.

  1. Sort your BED file: You can use the sort-bed command from the BEDOPS suite, or sort -k1,1 -k2,2n in Unix to sort your BED file. Here is an example:

    sort -k1,1 -k2,2n genome_regions.bed > genome_regions_sorted.bed

    Then you can rerun your bedtools closest command with the sorted BED file:

    bedtools closest -a peaks_NHDF_.bed -b genome_regions_sorted.bed
  2. Specify a genome file: If you have a genome file (a text file listing all of the chromosome names and sizes in your genome), you can use the -g option to tell BEDTools the correct order of the chromosomes. Here is an example:

    bedtools closest -a peaks_NHDF_.bed -b genome_regions.bed -g genome_file.txt

    In this example, genome_file.txt should be a two-column, tab-separated file with chromosome names in the first column and chromosome sizes in the second column. The chromosomes should be listed in the desired order.

Commands

sort -k1,1 -k2,2n peaks_NHDF_.bed > peaks_NHDF_sorted.bed
sort -k1,1 -k2,2n Integrationsites_converted_.bed > Integrationsites_converted_sorted.bed
bedtools closest -a peaks_NHDF_sorted.bed -b Integrationsites_converted_sorted.bed -d > peaks_on_integrationsites.bed
bedtools closest -a peaks_NHDF_sorted.bed -b Integrationsites_converted_sorted.bed -d | awk 'BEGIN {OFS="\t"} {if ($2 < $7) print $0, $10 * -1; else print $0, $10}' > peaks_on_integrationsites.bed
#sort -k9,9 -k10,10n peaks_on_integrationsites.bed | awk '!seen[$9]++' > peaks_on_integrationsites_unique.bed
sort -k9,9 -k10,10n peaks_on_integrationsites.bed > peaks_on_integrationsites_sorted.bed

bedtools closest -a peaks_NHDF_sorted.bed -b Integrationsites_converted_1.bed -d | awk 'BEGIN {OFS="\t"} {if ($2 < $7) print $0, $10 * -1; else print $0, $10}' > peaks_on_integrationsites_1.bed
#sort -k9,9 -k10,10n peaks_on_integrationsites_1.bed | awk '!seen[$9]++' > peaks_on_integrationsites_unique_1.bed
sort -k9,9 -k10,10n peaks_on_integrationsites_1.bed | awk '($10 != -1) && !seen[$9]++' > peaks_on_integrationsites_unique_1.bed
#chr6    16523081        16524081        Merged-chr6-16523581-1  7.680000        chr6    20569311        20635162        WaGa_Czech-Sioli_et_al._Plos_Pathog_2020        4045231 -4045231
#cut -f10 peaks_on_integrationsites_1.bed | sort -n | uniq > doublecheck.txt
#-1
#4045231
#4085125
#4223056

> peaks_on_integrationsites_unique.bed
#for number in 1 2 3 4 5  6 7 8 9 10  11 12 13 14 15  16 17 18 19 20  21 22 23 24 25  26 27 28 29 30  31 32 33 34 35  36 37 38 39 40  41 42 43 44 45  46 47 48 49 50; do
for number in 51 52 53 54 55  56 57 58 59 60  61 62 63 64 65  66 67 68 69 70  71 72 73; do
  sed -n "${number}p" Integrationsites_converted_.bed > Integrationsites_converted_${number}.bed
  bedtools closest -a peaks_NHDF_sorted.bed -b Integrationsites_converted_${number}.bed -d | awk 'BEGIN {OFS="\t"} {if ($2 < $7) print $0, $10 * -1; else print $0, $10}' > peaks_on_integrationsites_${number}.bed
  sort -k9,9 -k10,10n peaks_on_integrationsites_${number}.bed | awk '($10 != -1) && !seen[$9]++' >> peaks_on_integrationsites_unique.bed
done

#check the bed file format using the following commands.
cat -t Integrationsites_converted_4.bed
sed 's/ \+/\t/g' Integrationsites_converted_4.bed > Integrationsites_converted_tab.bed
cut -f2,3 Integrationsites_converted_4.bed | grep -Pv '^\d+\t\d+$'

~/Tools/csv2xls-0.4/csv_to_xls.py peaks_on_integrationsites_unique_.bed -d$'\t' -o peaks_on_integrationsites.xls

Antibodies and cell lines that are commonly used in research

http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=wgEncodeBroadHistone

Antibody:

  • CBP (SC-369)
  • H3K4me1
  • H3K4me2
  • H3K4me3
  • H3K9ac
  • H3K9me1
  • H3K9me3
  • H3K27ac
  • H3K27me3
  • H3K36me3
  • H3K79me2
  • H4K20me1

Cell Line:

  • GM12878 (Tier 1)
  • H1-hESC (Tier 1)
  • K562 (Tier 1)
  • A549 (Tier 2)
  • HeLa-S3 (Tier 2)
  • HepG2 (Tier 2)
  • HUVEC (Tier 2)
  • Monocytes CD14+ RO01746
  • Dnd41
  • HSMMtube
  • NH-A appears to refer to Normal Human Astrocytes. Astrocytes are a type of cell in the brain and spinal cord (i.e., the central nervous system). They provide physical and nutritional support for neurons and modulate their functions. Astrocytes play a key role in a variety of functions including maintaining the blood-brain barrier, providing nutrients to nervous tissue, and playing a role in the repair and scarring process of the brain and spinal cord following traumatic injuries. They are also involved in many neurological conditions, making them an important focus for research.
  • NHDF-Ad: Normal Human Dermal Fibroblasts (Adult). These cells are found in the dermis layer of the skin and are responsible for generating connective tissue and allowing the skin to recover from injury.
  • NHEK: Normal Human Epidermal Keratinocytes. These cells are from the outermost layer of the skin and are involved in protecting the body from environmental damage.
  • NHLF: Normal Human Lung Fibroblasts. Fibroblasts are the most common cells of connective tissue in the body and play a crucial role in wound healing.
  • HSMM: Human Skeletal Muscle Myoblasts. These cells are precursors to muscle cells (myocytes) and play a crucial role in muscle repair and regeneration.
  • HMEC: Human Mammary Epithelial Cells. These cells line the ducts and lobules of the mammary gland.
  • Osteoblasts: These are the cells that are responsible for bone formation.

Defining and Categorizing Promoter Types Based on the ‘GRGGC’ Motif Frequency and Distance to TSS

  1. generate_promter_sequences

    #!/usr/bin/env python3
    #./1_generate_promoter_sequences.py gencode.v43.annotation.gtf.db
    import gffutils
    from pyfaidx import Fasta
    import argparse
    from Bio import SeqIO
    from Bio.SeqRecord import SeqRecord
    from Bio.Seq import Seq
    
    def extract_promoter_sequences(gencode_db, genome_records, upstream_length=3000, downstream_length=3000):
        db = gffutils.FeatureDB(gencode_db, keep_order=True)
    
        promoters = []
        for gene in db.features_of_type('gene'): #CDS
            if gene.strand == '+':
                promoter_start = max(gene.start - upstream_length, 1)
                promoter_end = gene.start + downstream_length - 1
            else: # gene.strand == '-'
                promoter_start = gene.end - downstream_length + 1
                promoter_end = min(gene.end + upstream_length, len(genome_records[gene.chrom]))
    
            promoter_seq = str(genome_records[gene.chrom][promoter_start-1:promoter_end])
    
            # commenting the following two lines don't effect the output size, both are 62703 records 
            #if not set(promoter_seq.upper()) <= {"A", "C", "G", "T"}:
            #    print(f"Invalid sequence for id {gene.id}: {promoter_seq}")
    
            if gene.strand == '-':
                promoter_seq = str(Seq(promoter_seq).reverse_complement())
    
            record = SeqRecord(Seq(promoter_seq.upper()), id=gene.id, description="")
            promoters.append(record)
        return promoters
    
    if __name__ == "__main__":
        parser = argparse.ArgumentParser(description='Plot peak distribution around TSS.')
        parser.add_argument('gencode_db', type=str, help='Path to the gencode.v43.annotation.gtf.db file')
        args = parser.parse_args()
    
        genome_records = Fasta('/ref/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa')
    
        promoter_sequences = extract_promoter_sequences(args.gencode_db, genome_records)
        SeqIO.write(promoter_sequences, "promoter_sequences_3000_3000.fasta", "fasta")
        #62703 containing ENSG00000198804.2
  2. find motifs in promoters

    #!/usr/bin/env python3
    #./find_motifs_in_promoters.py
    import re
    from Bio import SeqIO
    
    def find_motif_frequency_and_distribution(fasta_file, output_file, motif_1="GRGGC", motif_2="GCCYR"):
        motif_1 = motif_1.replace("R", "[AG]").replace("Y", "[CT]")
        motif_2 = motif_2.replace("R", "[AG]").replace("Y", "[CT]")
    
        promoter_sequences = SeqIO.parse(fasta_file, "fasta")
        #with open(output_file, 'w') as f:
        #    f.write("PromoterID\tMotif1_Count\tMotif2_Count\tMotif1_Positions\tMotif2_Positions\n")
        #    for promoter in promoter_sequences:
        #        motif1_positions = [m.start() for m in re.finditer(motif_1, str(promoter.seq))]
        #        motif2_positions = [m.start() for m in re.finditer(motif_2, str(promoter.seq))]
        #        line = f"{promoter.id}\t{len(motif1_positions)}\t{len(motif2_positions)}\t{motif1_positions}\t{motif2_positions}\n"
        #        f.write(line)
        with open(output_file, 'w') as f:
            f.write("PromoterID\tMotif1_Count\tMotif2_Count\tMotif1_Center_Positions\tMotif2_Center_Positions\n")
            for promoter in promoter_sequences:
                motif1_positions = [(m.start() + len(motif_1) // 2 - 3000) for m in re.finditer(motif_1, str(promoter.seq))]
                motif2_positions = [(m.start() + len(motif_2) // 2 - 3000) for m in re.finditer(motif_2, str(promoter.seq))]
                line = f"{promoter.id}\t{len(motif1_positions)}\t{len(motif2_positions)}\t{motif1_positions}\t{motif2_positions}\n"
                f.write(line)
    
    if __name__ == "__main__":
        find_motif_frequency_and_distribution("promoter_sequences_3000_3000.fasta", "motif_counts_positions.txt")
  3. R codes

    #--3.1. draw plots for all motifs
    #DELETE in Kate '[' and ']' in motif_counts_positions.txt
    data <- read.table("motif_counts_positions.txt", sep="\t", header=TRUE, stringsAsFactors=FALSE)
    
    data <- rename(data, Motif_GRGGC_Count=Motif1_Count, Motif_GCCYR_Count=Motif2_Count, Motif_GRGGC_Center_Positions = Motif1_Center_Positions, Motif_GCCYR_Center_Positions = Motif2_Center_Positions)
    
    data$Motif_GRGGC_Count <- as.numeric(data$Motif_GRGGC_Count)
    data$Motif_GCCYR_Count <- as.numeric(data$Motif_GCCYR_Count)
    
    # Histogram of Motif1 counts
    png("Histogram_of_Motif_GRGGC_counts.png", width=800, height=800)
    hist(data$Motif_GRGGC_Count, main="Motif GRGGC Counts", xlab="Counts")
    dev.off()
    
    # Histogram of Motif2 counts
    png("Histogram_of_Motif_GCCYR_counts.png", width=800, height=800)
    hist(data$Motif_GCCYR_Count, main="Motif GCCYR Counts", xlab="Counts")
    dev.off()
    
    # # Convert list-like string to actual list of numbers
    # data$Motif_GRGGC_Center_Positions <- sapply(strsplit(gsub("[\\[\\]\\s+]", "", data$Motif_GRGGC_Center_Positions), ","), function(x) as.numeric(x[nzchar(x)]))
    # data$Motif_GCCYR_Center_Positions <- sapply(strsplit(gsub("[\\[\\]\\s+]", "", data$Motif_GCCYR_Center_Positions), ","), function(x) as.numeric(x[nzchar(x)]))
    data$Motif_GRGGC_Center_Positions <- strsplit(as.character(data$Motif_GRGGC_Center_Positions), ",")
    data$Motif_GCCYR_Center_Positions <- strsplit(as.character(data$Motif_GCCYR_Center_Positions), ",")
    
    # Create data frames for each motif
    data_motif_GRGGC <- data.frame(PromoterID = rep(data$PromoterID, data$Motif_GRGGC_Count), 
                              Motif = "Motif GRGGC", 
                              Position = unlist(data$Motif_GRGGC_Center_Positions))
    
    data_motif_GCCYR <- data.frame(PromoterID = rep(data$PromoterID, data$Motif_GCCYR_Count), 
                              Motif = "Motif GCCYR", 
                              Position = unlist(data$Motif_GCCYR_Center_Positions))
    
    # Combine motif data frames
    data_positions <- rbind(data_motif_GRGGC, data_motif_GCCYR)
    
    # Get positions as numeric values, excluding NAs
    motif1_positions <- as.numeric(data_positions$Position[data_positions$Motif == "Motif GRGGC"])
    # Plot density of Motif1 positions
    png("Density_of_Motif_GRGGC_positions.png", width=800, height=800)
    plot(density(motif1_positions, na.rm = TRUE), main="Density Plot of Motif GRGGC Positions", xlab="Position")
    dev.off()
    
    # Get positions as numeric values, excluding NAs
    motif2_positions <- as.numeric(data_positions$Position[data_positions$Motif == "Motif GCCYR"])
    # Plot density of Motif2 positions
    png("Density_of_Motif_GCCYR_positions.png", width=800, height=800)
    plot(density(motif2_positions, na.rm = TRUE), main="Density Plot of Motif GCCYR Positions", xlab="Position")
    dev.off()
    
    #--3.2. split data into two dataframe, one is in which PromoterID occurres in filtered_data$Closest.Gene..Ensemble.ID., another is the remaining
    # Create a new data frame for rows in 'data' where the PromoterID is in filtered_data's Closest Gene (Ensemble ID)
    data_in_filtered <- data[data$PromoterID %in% filtered_data$Closest.Gene..Ensemble.ID., ]
    data_in_filtered$Motif_GRGGC_Center_Positions <- sapply(data_in_filtered$Motif_GRGGC_Center_Positions, toString)
    data_in_filtered$Motif_GCCYR_Center_Positions <- sapply(data_in_filtered$Motif_GCCYR_Center_Positions, toString)
    write.table(data_in_filtered, file="motifs_in_promoters_with_LT_peak.txt", sep="\t", row.names = FALSE)
    
    # Create another data frame for the remaining rows
    data_not_in_filtered <- data[!data$PromoterID %in% filtered_data$Closest.Gene..Ensemble.ID., ]
    data_not_in_filtered$Motif_GRGGC_Center_Positions <- sapply(data_not_in_filtered$Motif_GRGGC_Center_Positions, toString)
    data_not_in_filtered$Motif_GCCYR_Center_Positions <- sapply(data_not_in_filtered$Motif_GCCYR_Center_Positions, toString)
    write.table(data_not_in_filtered, file="motifs_in_promoters_without_LT_peak.txt", sep="\t", row.names = FALSE)
    
    #--3.3. draw plots for motifs_in_promoters_with_LT_peak.txt
    #DELETE with Kate '"' in motifs_in_promoters_with_LT_peak.txt
    data <- read.table("motifs_in_promoters_with_LT_peak.txt", sep="\t", header=TRUE, stringsAsFactors=FALSE)
    
    data$Motif_GRGGC_Count <- as.numeric(data$Motif_GRGGC_Count)
    data$Motif_GCCYR_Count <- as.numeric(data$Motif_GCCYR_Count)
    
    # Histogram of Motif_GRGGC counts
    png("Histogram_of_Motif_GRGGC_in_promoters_with_LT_peak.png", width=800, height=800)
    hist(data$Motif_GRGGC_Count, main="Motif GRGGC Counts", xlab="Counts")
    dev.off()
    
    # Histogram of Motif_GCCYR counts
    png("Histogram_of_Motif_GCCYR_in_promoters_with_LT_peak.png", width=800, height=800)
    hist(data$Motif_GCCYR_Count, main="Motif GCCYR Counts", xlab="Counts")
    dev.off()
    
    data$Motif_GRGGC_Center_Positions <- strsplit(as.character(data$Motif_GRGGC_Center_Positions), ",")
    data$Motif_GCCYR_Center_Positions <- strsplit(as.character(data$Motif_GCCYR_Center_Positions), ",")
    
    # Create data frames for each motif
    data_motif1 <- data.frame(PromoterID = rep(data$PromoterID, data$Motif_GRGGC_Count), 
                              Motif = "Motif GRGGC", 
                              Position = unlist(data$Motif_GRGGC_Center_Positions))
    
    data_motif2 <- data.frame(PromoterID = rep(data$PromoterID, data$Motif_GCCYR_Count), 
                              Motif = "Motif GCCYR", 
                              Position = unlist(data$Motif_GCCYR_Center_Positions))
    
    # Combine motif data frames
    data_positions <- rbind(data_motif1, data_motif2)
    
    # Get positions as numeric values, excluding NAs
    motif1_positions <- as.numeric(data_positions$Position[data_positions$Motif == "Motif GRGGC"])
    # Plot density of Motif_GRGGC positions
    png("Density_of_Motif_GRGGC_in_promoters_with_LT_peak.png", width=800, height=800)
    plot(density(motif1_positions, na.rm = TRUE), main="Density Plot of Motif GRGGC Positions", xlab="Position")
    dev.off()
    
    # Get positions as numeric values, excluding NAs
    motif2_positions <- as.numeric(data_positions$Position[data_positions$Motif == "Motif GCCYR"])
    # Plot density of Motif_GCCYR positions
    png("Density_of_Motif_GCCYR_in_promoters_with_LT_peak.png", width=800, height=800)
    plot(density(motif2_positions, na.rm = TRUE), main="Density Plot of Motif GCCYR Positions", xlab="Position")
    dev.off()
    
    #--3.4. draw plots for motifs_in_promoters_without_LT_peak.txt
    #DELETE '"' with Kate in motifs_in_promoters_without_LT_peak.txt
    data <- read.table("motifs_in_promoters_without_LT_peak.txt", sep="\t", header=TRUE, stringsAsFactors=FALSE)
    
    data$Motif_GRGGC_Count <- as.numeric(data$Motif_GRGGC_Count)
    data$Motif_GCCYR_Count <- as.numeric(data$Motif_GCCYR_Count)
    
    # Histogram of Motif_GRGGC counts
    png("Histogram_of_Motif_GRGGC_in_promoters_without_LT_peak.png", width=800, height=800)
    hist(data$Motif_GRGGC_Count, main="Motif GRGGC Counts", xlab="Counts")
    dev.off()
    
    # Histogram of Motif_GCCYR counts
    png("Histogram_of_Motif_GCCYR_in_promoters_without_LT_peak.png", width=800, height=800)
    hist(data$Motif_GCCYR_Count, main="Motif GCCYR Counts", xlab="Counts")
    dev.off()
    
    data$Motif_GRGGC_Center_Positions <- strsplit(as.character(data$Motif_GRGGC_Center_Positions), ",")
    data$Motif_GCCYR_Center_Positions <- strsplit(as.character(data$Motif_GCCYR_Center_Positions), ",")
    
    # Create data frames for each motif
    data_motif1 <- data.frame(PromoterID = rep(data$PromoterID, data$Motif_GRGGC_Count), 
                              Motif = "Motif GRGGC", 
                              Position = unlist(data$Motif_GRGGC_Center_Positions))
    
    data_motif2 <- data.frame(PromoterID = rep(data$PromoterID, data$Motif_GCCYR_Count), 
                              Motif = "Motif GCCYR", 
                              Position = unlist(data$Motif_GCCYR_Center_Positions))
    
    # Combine motif data frames
    data_positions <- rbind(data_motif1, data_motif2)
    
    # Get positions as numeric values, excluding NAs
    motif1_positions <- as.numeric(data_positions$Position[data_positions$Motif == "Motif GRGGC"])
    # Plot density of Motif_GRGGC positions
    png("Density_of_Motif_GRGGC_in_promoters_without_LT_peak.png", width=800, height=800)
    plot(density(motif1_positions, na.rm = TRUE), main="Density Plot of Motif GRGGC Positions", xlab="Position")
    dev.off()
    
    # Get positions as numeric values, excluding NAs
    motif2_positions <- as.numeric(data_positions$Position[data_positions$Motif == "Motif GCCYR"])
    # Plot density of Motif_GCCYR positions
    png("Density_of_Motif_GCCYR_in_promoters_without_LT_peak.png", width=800, height=800)
    plot(density(motif2_positions, na.rm = TRUE), main="Density Plot of Motif GCCYR Positions", xlab="Position")
    dev.off()
    
    #--3.5. assign each promoter to a prefined class according to Total_Motifs and Avg_Motif_Distance_To_TSS
    data <- read.table("motifs_in_promoters_with_LT_peak.txt", sep="\t", header=TRUE, stringsAsFactors=FALSE)
    data$Motif_GRGGC_Count <- as.numeric(data$Motif_GRGGC_Count)
    data$Motif_GCCYR_Count <- as.numeric(data$Motif_GCCYR_Count)
    
    # Add a column for the sum of Motif_GRGGC_Count and Motif_GCCYR_Count
    data <- data %>%
      mutate(Total_Motifs = Motif_GRGGC_Count + Motif_GCCYR_Count)
    
    # Calculate absolute average Motif_GRGGC distance to TSS
    data$Avg_Motif_GRGGC_Distance_To_TSS <- sapply(strsplit(data$Motif_GRGGC_Center_Positions, ", "), 
                                             function(x) round(mean(as.numeric(x), na.rm = TRUE)))
    # Calculate absolute average Motif_GCCYR distance to TSS
    data$Avg_Motif_GCCYR_Distance_To_TSS <- sapply(strsplit(data$Motif_GCCYR_Center_Positions, ", "), 
                                             function(x) round(mean(as.numeric(x), na.rm = TRUE)))
    
    # Revised function to calculate the mean of two strings of comma-separated numbers
    mean_of_strings <- function(str1, str2) {
      # Convert strings of numbers to numeric vectors
      vec1 <- as.numeric(strsplit(str1, split = ",")[[1]])
      vec2 <- as.numeric(strsplit(str2, split = ",")[[1]])
      # Handle cases where vec1 or vec2 is NULL (i.e., when str1 or str2 is an empty string)
      if (is.null(vec1)) {
        vec1 <- numeric(0)
      }
      if (is.null(vec2)) {
        vec2 <- numeric(0)
      }
      # Compute the mean of the two vectors
      return(round(mean(c(vec1, vec2), na.rm = TRUE)))
    }
    # Apply the function to each row of the dataframe
    data$Avg_Motif_Distance_To_TSS <- mapply(mean_of_strings, data$Motif_GRGGC_Center_Positions, data$Motif_GCCYR_Center_Positions)
    
    # Assign classes based on the given rules
    #change the class name as string "Class 1", "Class 2" ... in the following code.
    data <- data %>%
      mutate(
        Class = case_when(
          Total_Motifs < 50 & abs(Avg_Motif_Distance_To_TSS) <= 1000 ~ "Total_Motifs < 50 & abs(Avg_Motif_Distance_To_TSS) <= 1000",
          Total_Motifs < 50 & abs(Avg_Motif_Distance_To_TSS) > 1000 ~ "Total_Motifs < 50 & abs(Avg_Motif_Distance_To_TSS) > 1000",
          Total_Motifs >= 50 & Total_Motifs < 100 & abs(Avg_Motif_Distance_To_TSS) <= 1000 ~ "50 <= Total_Motifs < 100 & abs(Avg_Motif_Distance_To_TSS) <= 1000",
          Total_Motifs >= 50 & Total_Motifs < 100 & abs(Avg_Motif_Distance_To_TSS) > 1000 ~ "50 <= Total_Motifs < 100 & abs(Avg_Motif_Distance_To_TSS) > 1000",
          Total_Motifs >= 100 & Total_Motifs < 150 & abs(Avg_Motif_Distance_To_TSS) <= 1000 ~ "100 <= Total_Motifs < 150 & abs(Avg_Motif_Distance_To_TSS) <= 1000",
          Total_Motifs >= 100 & Total_Motifs < 150 & abs(Avg_Motif_Distance_To_TSS) > 1000 ~ "100 <= Total_Motifs < 150 & abs(Avg_Motif_Distance_To_TSS) > 1000"
        )
      )
    
    write.table(data, file="motifs_in_promoters_with_LT_peak_Category.txt", sep="\t", row.names = FALSE)  
    #~/Tools/csv2xls-0.4/csv_to_xls.py motifs_in_promoters_with_LT_peak_Category.txt -d$'\t' -o  motifs_in_promoters_with_LT_peak_Category.xls;

Exploring Integrative Analysis of Multi-Omics Data from Public Repositories

  1. Integration and analysis of multi-omics data: Explore the integration of different types of omics data (e.g., genomics, transcriptomics, proteomics) from public repositories to uncover novel biological insights and identify potential biomarkers or therapeutic targets.

  2. Comparative genomics and evolution: Utilize publicly available genomic data to study the evolution of specific gene families or regulatory elements across different species. Investigate the functional implications of genomic variations and their impact on phenotype.

  3. Network analysis of biological systems: Build and analyze biological networks using public repository data, such as protein-protein interaction networks or gene regulatory networks, to identify key nodes or modules associated with specific diseases or biological processes.

  4. Machine learning and predictive modeling: Apply machine learning algorithms to public repository data to develop predictive models for disease diagnosis, drug response prediction, or patient outcome prognosis. Explore the potential of deep learning techniques for analyzing large-scale biological datasets.

  5. Functional annotation and pathway analysis: Use publicly available functional annotation databases and pathway information to annotate and interpret genomic or transcriptomic data. Identify enriched biological pathways or functional modules associated with specific conditions or treatments.

  6. Metagenomics and microbiome analysis: Analyze publicly available metagenomic and microbiome data to investigate microbial diversity, community dynamics, and functional potential in different environments or disease states. Explore the role of the microbiome in human health and disease.

  7. Drug repurposing and target identification: Utilize public repository data, including drug databases and gene expression profiles, to identify potential drug candidates for repurposing and discover new therapeutic targets for specific diseases.

RNAseq processing (1457)

  1. construct DESeqDataSet from Matrix

    library("AnnotationDbi")
    library("clusterProfiler")
    library("ReactomePA")
    library("org.Hs.eg.db")
    library(DESeq2)
    library(gplots)
    library(ggplot2)
    library(ggrepel)
    setwd("/home/jhuang/DATA/Data_Samira_RNAseq/results/featureCounts/")
    
    #cut -f2- merged_gene_counts.txt > merged_gene_counts_2.txt
    d.raw<- read.delim2("merged_gene_counts.txt",sep="\t", header=TRUE, row.names=1)
    colnames(d.raw)<- c("gene_name","TCR9_r3","X1457_r1","mock_r3","M2_r2","delta9H_r1","delta9H_r3","M2_r3","X1457_r2","M10_r3","TCR9_r1","M1_r1","TCR9_r2","M1_r2","X1457_r3","M10_r2","M2_r1","mock_r2","delta9H_r2","M1_r3","mock_r1","M10_r1")
    col_order <- c("gene_name","mock_r1","mock_r2","mock_r3","M1_r1","M1_r2","M1_r3","M2_r1","M2_r2","M2_r3","M10_r1","M10_r2","M10_r3","X1457_r1","X1457_r2","X1457_r3","TCR9_r1","TCR9_r2","TCR9_r3","delta9H_r1","delta9H_r2","delta9H_r3")
    reordered.raw <- d.raw[,col_order]
    reordered.raw$gene_name <- NULL
    
    #NOTE that we should d instead of d.raw!!!!!!
    d <- reordered.raw[rowSums(reordered.raw>3)>2,]
    
    condition = as.factor(c("mock","mock","mock","M1","M1","M1","M2","M2","M2","M10","M10","M10","X1457","X1457","X1457","TCR9","TCR9","TCR9","delta9H","delta9H","delta9H"))
    ids = as.factor(c("mock_r1","mock_r2","mock_r3","M1_r1","M1_r2","M1_r3","M2_r1","M2_r2","M2_r3","M10_r1","M10_r2","M10_r3","X1457_r1","X1457_r2","X1457_r3","TCR9_r1","TCR9_r2","TCR9_r3","delta9H_r1","delta9H_r2","delta9H_r3"))
    replicate = as.factor(c("r1","r2","r3","r1","r2","r3","r1","r2","r3","r1","r2","r3","r1","r2","r3","r1","r2","r3","r1","r2","r3"))
    
    #Note that we need d
    cData = data.frame(row.names=colnames(d), condition=condition, replicate=replicate, ids=ids)
    dds<-DESeqDataSetFromMatrix(countData=d, colData=cData, design=~condition)  #batch+
    
    > sizeFactors(dds)
    NULL
    > dds <- estimateSizeFactors(dds)
    > sizeFactors(dds)
       mock_r1    mock_r2    mock_r3      M1_r1      M1_r2      M1_r3      M2_r1 
     1.1510294  1.0108629  1.2046637  0.9219507  1.2912217  1.0233951  1.1781932 
         M2_r2      M2_r3     M10_r1     M10_r2     M10_r3   X1457_r1   X1457_r2 
     1.0286656  1.1274057  0.8521032  0.9723604  0.7937256  0.8869522  1.0276279 
      X1457_r3    TCR9_r1    TCR9_r2    TCR9_r3 delta9H_r1 delta9H_r2 delta9H_r3 
     0.8798504  1.4702299  0.9617160  1.2175588  0.7935592  0.8016998  1.0166897 
    
    bamCoverage --bam ./STAR/V_8_2_4_p600_d3_DonorIAligned.sortedByCoord.out.bam -o bigWigs/ctrl_d3_DonorI_norm.bw --binSize 10 --scaleFactor 0.8182619037059573 --effectiveGenomeSize 2913022398
    bamCoverage --bam ./STAR/V_8_2_3_p600_d3_DonorIIAligned.sortedByCoord.out.bam -o bigWigs/ctrl_d3_DonorII_norm.bw --binSize 10 --scaleFactor 1.230524791752137 --effectiveGenomeSize 2913022398
    bamCoverage --bam ./STAR/V_8_2_4_p600_d8_DonorIAligned.sortedByCoord.out.bam -o bigWigs/ctrl_d8_DonorI_norm.bw --binSize 10 --scaleFactor 0.9406161731990421 --effectiveGenomeSize 2913022398
    bamCoverage --bam ./STAR/V_8_2_3_p600_d8_DonorIIAligned.sortedByCoord.out.bam -o bigWigs/ctrl_d8_DonorII_norm.bw --binSize 10 --scaleFactor 1.5944164810367278 --effectiveGenomeSize 2913022398
    
    bamCoverage --bam ./STAR/V_8_4_2_p602_d3_DonorIAligned.sortedByCoord.out.bam -o bigWigs/LT_d3_DonorI_norm.bw --binSize 10 --scaleFactor 1.1691469144166922 --effectiveGenomeSize 2913022398
    bamCoverage --bam ./STAR/V_8_4_2_p602_d3_DonorIIAligned.sortedByCoord.out.bam -o bigWigs/LT_d3_DonorII_norm.bw --binSize 10 --scaleFactor 0.9627956504743693 --effectiveGenomeSize 2913022398
    bamCoverage --bam ./STAR/V_8_4_2_p602_d8_DonorIAligned.sortedByCoord.out.bam -o bigWigs/LT_d8_DonorI_norm.bw --binSize 10 --scaleFactor 0.9685710322875091 --effectiveGenomeSize 2913022398
    bamCoverage --bam ./STAR/V_8_4_1_p602_d8_DonorIIAligned.sortedByCoord.out.bam -o bigWigs/LT_d8_DonorII_norm.bw --binSize 10 --scaleFactor 0.7369838699288324 --effectiveGenomeSize 2913022398
    
    bamCoverage --bam ./STAR/V_8_2_4_p605_d3_DonorIAligned.sortedByCoord.out.bam -o bigWigs/LTtr_d3_DonorI_norm.bw --binSize 10 --scaleFactor 0.7650745897995206 --effectiveGenomeSize 2913022398
    bamCoverage --bam ./STAR/V_8_2_3_p605_d3_DonorIIAligned.sortedByCoord.out.bam -o bigWigs/LTtr_d3_DonorII_norm.bw --binSize 10 --scaleFactor 1.2072732417906324 --effectiveGenomeSize 2913022398
    bamCoverage --bam ./STAR/V_8_2_4_p605_d8_DonorIAligned.sortedByCoord.out.bam -o bigWigs/LTtr_d8_DonorI_norm.bw --binSize 10 --scaleFactor 0.617050461769713 --effectiveGenomeSize 2913022398
    bamCoverage --bam ./STAR/V_8_2_3_p605_d8_DonorIIAligned.sortedByCoord.out.bam -o bigWigs/LTtr_d8_DonorII_norm.bw --binSize 10 --scaleFactor 1.1972841763570858 --effectiveGenomeSize 2913022398
    
    bamCoverage --bam ./STAR/K331A_DonorIAligned.sortedByCoord.out.bam -o bigWigs/K331A_DonorI_norm.bw --binSize 10 --scaleFactor 0.5914211756222816 --effectiveGenomeSize 2913022398
    bamCoverage --bam ./STAR/K331A_DonorIIAligned.sortedByCoord.out.bam -o bigWigs/K331A_DonorII_norm.bw --binSize 10 --scaleFactor 1.6631219993121327 --effectiveGenomeSize 2913022398
    
    rld <- rlogTransformation(dds)
  2. PCA plots pca_1457

    library(ggplot2)
    svg("pca6.svg")
    data <- plotPCA(rld, intgroup=c("condition", "replicate"), returnData=TRUE)
    percentVar <- round(100 * attr(data, "percentVar"))
    ggplot(data, aes(PC1, PC2, color=condition, shape=replicate)) +
      geom_point(size=3) +
      scale_color_manual(values = c("mock" = "grey",
                                    "M1"="#1f78b4")) +
      xlab(paste0("PC1: ",percentVar[1],"% variance")) +
      ylab(paste0("PC2: ",percentVar[2],"% variance")) +
      coord_fixed()
    dev.off()
    
    png("pca.png", width=800, height=800)
    data <- plotPCA(rld, intgroup=c("condition", "replicate"), returnData=TRUE)
    percentVar <- round(100 * attr(data, "percentVar"))
    ggplot(data, aes(PC1, PC2, color=condition, shape=replicate)) +
      geom_point(size=3) +
      scale_color_manual(values = c("mock" = "darkgrey",
                                    "M1"="#a14a1a", "M2"="#33a02c", "M10"="#1f78b4", "X1457"="#e31a1c", "TCR9"="orange", "delta9H"="purple")) +
      xlab(paste0("PC1: ",percentVar[1],"% variance")) +
      ylab(paste0("PC2: ",percentVar[2],"% variance")) +
      coord_fixed()
    dev.off()
    
    png("pca2.png", 1200, 800)
    plotPCA(rld, intgroup=c("condition"))
    dev.off()
    
    data <- plotPCA(rld, intgroup=c("condition", "donor"), returnData=TRUE)
    write.csv(data, file="plotPCA_data.csv")
    #calculate all PCs including PC3 with the following codes
    library(genefilter)
    ntop <- 500
    rv <- rowVars(assay(rld))
    select <- order(rv, decreasing = TRUE)[seq_len(min(ntop, length(rv)))]
    mat <- t( assay(rld)[select, ] )
    pc <- prcomp(mat)
    pc$x[,1:3]
    #df_pc <- data.frame(pc$x[,1:3])
    df_pc <- data.frame(pc$x)
    identical(rownames(data), rownames(df_pc)) #-->TRUE
    ## define the desired order of row names
    #desired_order <- rownames(data)
    ## sort the data frame by the desired order of row names
    #df <- df[match(desired_order, rownames(df_pc)), ]
    data$PC1 <- NULL
    data$PC2 <- NULL
    merged_df <- merge(data, df_pc, by = "row.names")
    #merged_df <- merged_df[, -1]
    row.names(merged_df) <- merged_df$Row.names
    merged_df$Row.names <- NULL  # remove the "name" column
    merged_df$name <- NULL
    merged_df <- merged_df[, c("PC1","PC2","PC3","PC4","PC5","PC6","PC7","PC8","group","condition","donor")]
    #results in 26PCs: merged_df <- merged_df[, c("PC1","PC2","PC3","PC4","PC5","PC6","PC7","PC8","group","condition","donor")]
    write.csv(merged_df, file="merged_df_8PCs.csv")
    #> summary(pc)    #0.5657 0.2195 0.1347  --> 0.57 0.22 0.13
    #Importance of components:
    #                           PC1     PC2    PC3    PC4     PC5     PC6     PC7
    #Standard deviation     19.4051 12.0878 9.4683 4.5569 4.01016 3.00610 2.71918
    #Proportion of Variance  0.5657  0.2195 0.1347 0.0312 0.02416 0.01358 0.01111
    #Cumulative Proportion   0.5657  0.7853 0.9200 0.9512 0.97531 0.98889 1.00000
  3. volcano plots X1457_vs_mock

    dds$condition <- relevel(dds$condition, "mock")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("M1_vs_mock","M2_vs_mock","M10_vs_mock", "delta9H_vs_mock","TCR9_vs_mock","X1457_vs_mock")
    
    dds$condition <- relevel(dds$condition, "X1457")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("M10_vs_X1457", "delta9H_vs_X1457")
    
    library(biomaRt)
    listEnsembl()
    listMarts()
    ensembl <- useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", version="104")
    datasets <- listDatasets(ensembl)
    
    listEnsemblArchives()
    attributes = listAttributes(ensembl)
    attributes[1:25,]
    
    library(dplyr)
    library(tidyverse)
    
    for (i in clist) {
    #i<-clist[1]
    #i<-"M1_vs_mock"
      contrast = paste("condition", i, sep="_")
      res = results(dds, name=contrast)
      res <- res[!is.na(res$log2FoldChange),]
      #geness <- AnnotationDbi::select(edb86, keys = rownames(res), keytype = "GENEID", columns = c("ENTREZID","EXONID","GENEBIOTYPE","GENEID","GENENAME","PROTEINDOMAINSOURCE","PROTEINID","SEQNAME","SEQSTRAND","SYMBOL","TXBIOTYPE","TXID","TXNAME","UNIPROTID"))
      #geness <- AnnotationDbi::select(edb86, keys = rownames(res), keytype = "GENEID", columns = c("GENEID", "ENTREZID", "SYMBOL", "GENENAME","GENEBIOTYPE","TXBIOTYPE","SEQSTRAND","UNIPROTID"))
      # In the ENSEMBL-database, GENEID is ENSEMBL-ID.
      #geness <- AnnotationDbi::select(edb86, keys = rownames(res), keytype = "GENEID", columns = c("GENEID", "SYMBOL", "GENEBIOTYPE"))  #  "ENTREZID", "TXID","TXBIOTYPE","TXSEQSTART","TXSEQEND"
      #geness <- geness[!duplicated(geness$GENEID), ]
    
      #using getBM replacing AnnotationDbi::select
      #filters = 'ensembl_gene_id' means the records should always have a valid ensembl_gene_ids.
      geness <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
          filters = 'ensembl_gene_id',
          values = rownames(res), 
          mart = ensembl)
      geness_uniq <- distinct(geness, ensembl_gene_id, .keep_all= TRUE)
    
      #merge by column by common colunmn name, in the case "GENEID"
      res$ENSEMBL = rownames(res)
      identical(rownames(res), rownames(geness_uniq))
      res_df <- as.data.frame(res)
      geness_res <- merge(geness_uniq, res_df, by.x="ensembl_gene_id", by.y="ENSEMBL")
      dim(geness_res)
      rownames(geness_res) <- geness_res$ensembl_gene_id
      geness_res$ensembl_gene_id <- NULL
      write.csv(as.data.frame(geness_res[order(geness_res$pvalue),]), file = paste(i, "all.txt", sep="-"))
      up <- subset(geness_res, padj<=0.05 & log2FoldChange>=2)
      down <- subset(geness_res, padj<=0.05 & log2FoldChange<=-2)
      write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
      write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
    }
    
    #for i in "M1_vs_mock" "M2_vs_mock" "M10_vs_mock" "delta9H_vs_mock" "TCR9_vs_mock" "X1457_vs_mock"; do
    for i in "M10_vs_X1457" "delta9H_vs_X1457"; do
      # read files to geness_res
      echo "geness_res <- read.csv(file = paste(\"${i}\", \"all.txt\", sep=\"-\"), row.names=1)"
    
      echo "subset(geness_res, external_gene_name %in% top_g & pvalue < 0.05 & (abs(geness_res\$log2FoldChange) >= 2.0))"
      echo "geness_res\$Color <- \"NS or log2FC < 2.0\""
      echo "geness_res\$Color[geness_res\$pvalue < 0.05] <- \"P < 0.05\""
      echo "geness_res\$Color[geness_res\$padj < 0.05] <- \"P-adj < 0.05\""
      echo "geness_res\$Color[abs(geness_res\$log2FoldChange) < 2.0] <- \"NS or log2FC < 2.0\""
      echo "write.csv(geness_res, \"${i}_with_Category.csv\")"
    
      # pick top genes for either side of volcano to label
      # order genes for convenience:
      echo "geness_res\$invert_P <- (-log10(geness_res\$pvalue)) * sign(geness_res\$log2FoldChange)"
      echo "top_g <- c()"
      echo "top_g <- c(top_g, \
                 geness_res[, 'external_gene_name'][order(geness_res[, 'invert_P'], decreasing = TRUE)[1:100]], \
                 geness_res[, 'external_gene_name'][order(geness_res[, 'invert_P'], decreasing = FALSE)[1:100]])"
      echo "top_g <- unique(top_g)"
      echo "geness_res <- geness_res[, -1*ncol(geness_res)]"  # remove invert_P from matrix
    
      # Graph results
      echo "png(\"${i}.png\",width=1200, height=2000)"
      echo "ggplot(geness_res, \
          aes(x = log2FoldChange, y = -log10(pvalue), \
              color = Color, label = external_gene_name)) + \
          geom_vline(xintercept = c(2.0, -2.0), lty = \"dashed\") + \
          geom_hline(yintercept = -log10(0.05), lty = \"dashed\") + \
          geom_point() + \
          labs(x = \"log2(FC)\", y = \"Significance, -log10(P)\", color = \"Significance\") + \
          scale_color_manual(values = c(\"P < 0.05\"=\"darkgray\",\"P-adj < 0.05\"=\"red\",\"lysosomal\"=\"lightblue\",\"TFEB\"=\"green\",\"lysosomal & TFEB\"=\"cyan\",\"NS or log2FC < 2.0\"=\"gray\"),guide = guide_legend(override.aes = list(size = 4))) + scale_y_continuous(expand = expansion(mult = c(0,0.05))) + \
          geom_text_repel(data = subset(geness_res, external_gene_name %in% top_g & pvalue < 0.05 & (abs(geness_res\$log2FoldChange) >= 2.0)), size = 4, point.padding = 0.15, color = \"black\", min.segment.length = .1, box.padding = .2, lwd = 2) + \
          theme_bw(base_size = 16) + \
          theme(legend.position = \"bottom\")"
      echo "dev.off()"
    done
    
    ~/Tools/csv2xls-0.4/csv_to_xls.py \
    M1_vs_mock-all.txt \
    M1_vs_mock-up.txt \
    M1_vs_mock-down.txt \
    -d$',' -o M1_vs_mock.xls;
  4. clustering the genes and draw heatmap DEGs_heatmap_1457

    up1 <-read.csv(file="./M1_vs_mock-up.txt", header = TRUE, row.names = 1)
    up2 <-read.csv(file="./M2_vs_mock-up.txt", header = TRUE, row.names = 1)
    up3 <-read.csv(file="./M10_vs_mock-up.txt", header = TRUE, row.names = 1)
    up4 <-read.csv(file="./delta9H_vs_mock-up.txt", header = TRUE, row.names = 1)
    up5 <-read.csv(file="./TCR9_vs_mock-up.txt", header = TRUE, row.names = 1)
    up6 <-read.csv(file="./X1457_vs_mock-up.txt", header = TRUE, row.names = 1)
    
    down1 <-read.csv(file="./M1_vs_mock-down.txt", header = TRUE, row.names = 1)
    down2 <-read.csv(file="./M2_vs_mock-down.txt", header = TRUE, row.names = 1)
    down3 <-read.csv(file="./M10_vs_mock-down.txt", header = TRUE, row.names = 1)
    down4 <-read.csv(file="./delta9H_vs_mock-down.txt", header = TRUE, row.names = 1)
    down5 <-read.csv(file="./TCR9_vs_mock-down.txt", header = TRUE, row.names = 1)
    down6 <-read.csv(file="./X1457_vs_mock-down.txt", header = TRUE, row.names = 1)
    
    RNASeq.NoCellLine <- assay(rld)
    #Option3: as paper described, A heatmap showing expression values of all DEGs which are significant between any pair conditions.
    all_genes <- c(rownames(up1),rownames(down1),  rownames(up2),rownames(down2), rownames(up3),rownames(down3), rownames(up4),rownames(down4), rownames(up5),rownames(down5), rownames(up6),rownames(down6))  #5473
    all_genes <- unique(all_genes)   #2971
    RNASeq.NoCellLine_  <- RNASeq.NoCellLine[all_genes,]
    write.csv(as.data.frame(RNASeq.NoCellLine_), file ="significant_gene_expressions.txt")
    
    #clustering methods: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).  pearson or spearman
    datamat = RNASeq.NoCellLine_
    hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete")
    hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete")
    mycl = cutree(hr, h=max(hr$height)/1.08)
    mycol = c("YELLOW", "DARKBLUE", "DARKORANGE", "DARKMAGENTA", "DARKCYAN", "DARKRED",  "MAROON", "DARKGREEN", "LIGHTBLUE", "PINK", "MAGENTA", "LIGHTCYAN","LIGHTGREEN", "BLUE", "ORANGE", "CYAN", "RED", "GREEN");
    
    mycol = mycol[as.vector(mycl)]
    #sampleCols <- rep('GREY',ncol(RNASeq.NoCellLine_))
    #names(sampleCols) <- c("mock_r1", "mock_r2", "mock_r3", "mock_r4", "WAP_r1", "WAP_r2",  "WAP_r3", "WAP_r4", "WAC_r1","WAC_r2")
    #sampleCols[substr(colnames(RNASeq.NoCellLine_),1,4)=='mock'] <- 'GREY'
    #sampleCols[substr(colnames(RNASeq.NoCellLine_),1,3)=='WAP'] <- 'RED'
    #sampleCols[substr(colnames(RNASeq.NoCellLine_),1,3)=='dM_'] <- 'CYAN'
    #sampleCols[substr(colnames(RNASeq.NoCellLine_),1,3)=='dP_'] <- 'BLUE'
    #sampleCols[substr(colnames(RNASeq.NoCellLine_),1,3)=='WAC'] <- 'GREEN'
    png("DEGs_heatmap.png", width=900, height=1010)
    heatmap.2(as.matrix(datamat),Rowv=as.dendrogram(hr),Colv = NA, dendrogram = 'row',
                scale='row',trace='none',col=bluered(75), 
                RowSideColors = mycol, labRow="", srtCol=30, keysize=0.72, cexRow = 2, cexCol = 1.4)
    dev.off()
    
    #### cluster members #####
    write.csv(names(subset(mycl, mycl == '1')),file='cluster1_YELLOW.txt')
    write.csv(names(subset(mycl, mycl == '2')),file='cluster2_DARKBLUE.txt') 
    write.csv(names(subset(mycl, mycl == '3')),file='cluster3_DARKORANGE.txt')  
    
    ~/Tools/csv2xls-0.4/csv_to_xls.py \
    cluster1_YELLOW.txt \
    cluster2_DARKBLUE.txt \
    cluster3_DARKORANGE.txt \
    -d$',' -o gene_culsters.xls;
    
    ~/Tools/csv2xls-0.4/csv_to_xls.py \
    significant_gene_expressions.txt \
    -d',' -o DEGs_heatmap_data.xls;

Tools for SNP Visualization

For SNP visualization in Python, we can consider using the following packages:

  • Matplotlib: Matplotlib is a popular plotting library in Python that can be used to create various types of visualizations, including chromosome plots. You can plot SNPs as points or markers on the chromosomes using Matplotlib’s scatter or plot functions.

  • Bokeh: Bokeh is a powerful interactive visualization library in Python. It provides tools for creating interactive plots and offers features like zooming, panning, and tooltips. You can use Bokeh to create chromosome plots and represent SNPs as interactive data points.

  • PyGenomeTracks: PyGenomeTracks is a Python package specifically designed for visualizing genomic data. It provides a convenient interface for plotting genomic features, including SNPs, on chromosomes. It offers customization options for styling and annotating the plots.

In R, we can consider using the following packages for SNP visualization:

  • Gviz: Gviz is an R package that allows you to create interactive and customizable genome visualizations. It provides functions for plotting SNPs on chromosomes and offers features like zooming, panning, and linking to external data sources.

  • karyoploteR: karyoploteR is an R package designed for high-quality karyotype and genomic region plots. It provides functions for visualizing SNPs on chromosomes, allowing you to customize colors, sizes, and other plot attributes.

  • ggplot2: ggplot2 is a widely used plotting package in R that offers a grammar of graphics approach for creating various types of plots. You can use ggplot2 to create SNP plots on chromosomes, leveraging its flexibility and customization options.

Both Python and R provide a wide range of plotting and visualization packages, and the choice depends on your familiarity with the programming language and specific requirements for SNP visualization.

Calculate AT content of the flanking regions of peaks

#!/usr/bin/env python3
#python3 plot_peaks.py peaks.bed /home/jhuang/REFs/gencode.v43.annotation.gtf.db /ref/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa
#~/Tools/csv2xls-0.4/csv_to_xls.py _peaks_and_motifs_in_promoters_with_LT_peak.txt peak_tss_distances_.txt -d$'\t' -o peaks_and_motifs_in_promoters_with_LT_peak.xls;
#rename the name of sheet1 to "peaks_and_motifs_in_promoters_with_LT_peak" and "all_LT_peaks_NHDF" 
#mv peak_tss_distances_.xlsx all_LT_peaks_NHDF.xlsx

import pprint
import argparse
import matplotlib.pyplot as plt
import pandas as pd
import gffutils
import pyfaidx
import numpy as np

def calculate_at_content(seq):
    # Function to calculate the AT-content of a given sequence
    at_count = seq.count('A') + seq.count('T')
    total_count = len(seq)
    at_content = round((at_count / total_count) * 100, 1)
    return at_content

def plot_peak_distribution(peaks_file, gencode_db, fasta_file, upstream_length=500, downstream_length=500):
    db = gffutils.FeatureDB(gencode_db, keep_order=True)
    peaks = pd.read_table(peaks_file, header=None)
    peaks.columns = ['chrom', 'start', 'end', 'name', 'score']
    tss_positions = []
    tss_genes = []
    for gene in db.features_of_type('gene'):
        if gene.strand == '+':
            tss_positions.append(gene.start)
        else:
            tss_positions.append(gene.end)
        tss_genes.append(gene)
    closest_tss_indices = []
    peak_tss_distances = []
    peak_names = []
    closest_genes = []
    closest_gene_names = []
    closest_strands = []
    closest_gene_types = []  # New list for gene_type
    start_positions = []  # New list for Start Positions
    end_positions = []  # New list for End Positions
    start_distance_to_tss = []  # New list for Distance to TSS of Start Positions
    end_distance_to_tss = []  # New list for Distance to TSS of End Positions
    peak_starts = []
    peak_ends = []

    # Load the FASTA file using pyfaidx
    fasta = pyfaidx.Fasta(fasta_file)

    # New lists for AT-content
    start_flank_seqs = []
    end_flank_seqs = []
    start_at_content = []
    end_at_content = []

    for _, peak in peaks.iterrows():
        peak_center = (peak['start'] + peak['end']) // 2
        peak_start = peak['start']
        peak_end = peak['end']
        closest_tss_index = np.argmin([abs(tss - peak_center) for tss in tss_positions])
        closest_tss_indices.append(closest_tss_index)
        distance_to_tss = peak_center - tss_positions[closest_tss_index]
        peak_tss_distances.append(distance_to_tss)
        peak_names.append(peak['name'])
        closest_genes.append(tss_genes[closest_tss_index].id)
        if 'gene_name' in tss_genes[closest_tss_index].attributes:
            closest_gene_name = tss_genes[closest_tss_index].attributes['gene_name'][0]
        else:
            closest_gene_name = 'N/A'  # Set a default value if 'gene_name' attribute is not present
        closest_gene_names.append(closest_gene_name)
        closest_strands.append(tss_genes[closest_tss_index].strand)

        if 'gene_type' in tss_genes[closest_tss_index].attributes:
            closest_gene_type = tss_genes[closest_tss_index].attributes['gene_type'][0]
        else:
            closest_gene_type = 'N/A'  # Set a default value if 'gene_type' attribute is not present
        closest_gene_types.append(closest_gene_type)

        #start_distance_to_tss.append(peak_start - tss_positions[closest_tss_index])  # Add Distance to TSS of Start Position to the list
        #end_distance_to_tss.append(peak_end - tss_positions[closest_tss_index])  # Add Distance to TSS of End Position to the list

        peak_starts.append(peak_start)
        peak_ends.append(peak_end)

        # Get the flanking region coordinates
        start_flank_start = peak_start - upstream_length
        start_flank_end = peak_start - 1
        end_flank_start = peak_end + 1
        end_flank_end = peak_end + downstream_length

        # Retrieve the flanking region sequences from the FASTA file
        start_flank_seq = fasta[peak['chrom']][start_flank_start:start_flank_end].seq.upper()
        end_flank_seq = fasta[peak['chrom']][end_flank_start:end_flank_end].seq.upper()
        start_flank_seqs.append(start_flank_seq)
        end_flank_seqs.append(end_flank_seq)

        # # Calculate the AT-content of the flanking regions based on the gene strand
        # if tss_genes[closest_tss_index].strand== '+':
        #     start_at = calculate_at_content(start_flank_seq)
        #     end_at = calculate_at_content(end_flank_seq)
        # else:
        #     start_at = calculate_at_content(end_flank_seq)
        #     end_at = calculate_at_content(start_flank_seq)
        start_at = calculate_at_content(start_flank_seq)
        end_at = calculate_at_content(end_flank_seq)
        start_at_content.append(start_at)
        end_at_content.append(end_at)

    df = pd.DataFrame({
        'Peak Name': peak_names,
        'Peak Start': peak_starts,
        'Peak End': peak_ends,
        #'Closest TSS Index': closest_tss_indices,
        'Distance to TSS': peak_tss_distances,
        'Closest Gene (Ensemble ID)': closest_genes,
        'Closest Gene (Gene name)': closest_gene_names,
        'Gene Strand': closest_strands,
        'Gene Type': closest_gene_types,  # Add new column for gene_type
        #'start_distance_to_tss': start_distance_to_tss,
        #'end_distance_to_tss': end_distance_to_tss,
        'Start-Flanking AT-content (500 nt)': start_at_content,
        'End-Flanking AT-content (500 nt)': end_at_content,
        'Start-Flanking Sequence (500 nt)': start_flank_seqs,
        'End-Flanking Sequence (500 nt)': end_flank_seqs
    })

    df.to_excel('peak_tss_distances_.xlsx', index=False)
    df.to_csv('peak_tss_distances_.txt', sep='\t', index=False)  # Save as tab-separated text file
    counts, bin_edges = np.histogram(peak_tss_distances, bins=1000)
    bin_centers = (bin_edges[:-1] + bin_edges[1:]) / 2
    hist_df = pd.DataFrame({'Bin Center': bin_centers, 'Count': counts})
    hist_df.to_csv('peak_distribution_.csv', index=False)
    total_peaks = hist_df['Count'].sum()
    with open('peak_distribution_.csv', 'a') as f:
        f.write(f'\nTotal number of peaks: {total_peaks}')
    plt.hist(peak_tss_distances, bins=1000)
    plt.xlabel('Distance to TSS')
    plt.ylabel('Number of peaks')
    plt.title('Distribution of peaks around TSS')
    plt.savefig('peak_distribution_.png')
    plt.show()

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Plot peak distribution around TSS.')
    parser.add_argument('peaks_file', type=str, help='Path to the peaks.bed file')
    parser.add_argument('gencode_db', type=str, help='Path to the gencode.v43.annotation.db file')
    parser.add_argument('fasta_file', type=str, help='Path to the FASTA file')
    args = parser.parse_args()
    plot_peak_distribution(args.peaks_file, args.gencode_db, args.fasta_file)

梅尔克细胞病毒小抗原通过类型I干扰素信号传导而促进免疫逃逸

详细解释:梅尔克细胞多瘤病毒(Merkel cell polyomavirus,MCPyV)的小肿瘤抗原(small tumor antigen,sT)是该病毒的一个重要组分,通过干扰类型I干扰素(type I interferon)信号传导来促进免疫逃逸。

类型I干扰素是免疫系统中的一类重要信号分子,对抗病毒感染和肿瘤发展起着关键作用。然而,MCPyV的小肿瘤抗原通过干扰类型I干扰素信号的正常传导,从而导致免疫系统无法有效识别和清除感染细胞。

具体而言,小肿瘤抗原通过抑制关键分子的表达,如干扰素调节因子9(interferon regulatory factor 9,IRF9)和信号转导和转录激活因子1(signal transducer and activator of transcription 1,STAT1),来干扰类型I干扰素所引发的免疫应答。这种干扰导致了干扰素诱导基因(interferon-stimulated genes,ISGs)的抑制,从而削弱了宿主细胞对病毒感染的免疫反应。

这些发现对于理解梅尔克细胞多瘤病毒的感染机制和免疫逃逸至关重要。进一步研究小肿瘤抗原如何干扰类型I干扰素信号传导以及相关的免疫调节途径,有助于揭示肿瘤发展的分子机制,并为开发针对梅尔克细胞多瘤病毒感染的治疗策略提供新的目标和思路。

副标题:sT介导的TBK1-IRF3信号传导调控对MCPyV感染和发病机制的重要性

关键词:sT,病毒致癌蛋白,免疫逃逸,TBK1-IRF3

摘要

梅尔克细胞多瘤病毒(MCPyV)是导致大多数梅尔克细胞癌(MCC)的病因。MCPyV的编码能力有限,早期病毒蛋白大T(LT)和小T(sT)具有多种功能,对感染和转化起重要作用。感染和肿瘤发生之间早期病毒基因表达存在一个根本差异,肿瘤中表达截断的LT(LTtr),而sT在两种情况下都有表达,并促进肿瘤发生。本研究旨在识别早期病毒蛋白引发的感染和发病关键机制。我们利用原代人类细胞进行全基因组转录组和染色质研究。由于目前在感染和肿瘤发生模型方面的限制,我们通过过表达早期病毒蛋白的实验来模拟这些情况,可以单独进行实验或组合进行实验。

除了调节细胞周期和炎症过程外,我们还揭示了sT的一个重要新功能:下调参与识别和调控先天免疫系统的基因。我们发现,sT在TANK结合激酶1(TBK1)水平上降低了类型I干扰素(IFN)反应,从而有效减少了TBK1依赖的IFN-beta启动子的激活,但对IFN-beta诱导的ISRE元件的激活没有影响。同时,我们还发现LT和LTtr可以激活类型I干扰素反应,而sT则起到了抵消这一功能的作用。我们首次提供了由早期病毒蛋白引发和调控的类型I干扰素激活的机制理解。我们假设TBK1-IRF3调控是一种由sT修饰的重要机制,对感染、持续感染和肿瘤微环境调控具有影响。

引言

人类多聚病毒(hPyVs)具有狭窄的宿主范围和高度的细胞类型限制。因此,目前的体外细胞培养系统受到限制,而用于研究病毒生命周期和/或持续感染的体内动物模型不存在[1-5]。hPyVs可终身持续感染,然而,对于包括BKPyV、JCPyV和MCPyV在内的大多数hPyVs,它们在自然宿主中的持续感染部位/细胞类型尚未完全揭示,导致持续感染建立的机制也仅了解甚少[1, 3, 5-9]。尽管目前的持续感染模型认为,BKPyV虽然能感染多种细胞类型,但仅在具有受控病毒DNA复制、病毒后代产生和细胞再感染的内源性免疫应答细胞中建立持续感染[7],MCPyV则被认为能独立于病毒后代产生建立持续感染形式[3, 5]。

DNA病毒作为入侵病原体的第一道防线被先天免疫系统所识别。因此,hPyVs在入侵过程中被内源性体内的Toll样受体(TLRs)如TLR9在内体袋中识别,TLR9能够识别CpG未甲基化的双链DNA[10, 11]。此外,细胞质核酸传感器也参与对DNA病毒感染的早期识别。细胞质DNA识别的主要调节因子是干扰素基因刺激因子(STING),其受到cGAS、AIM2或IFI16等多种DNA传感器的调控[12]。

对于MCPyV感染和免疫调节的了解非常有限,只有少数几篇研究报告,主要是基于过表达研究。例如,LT的表达降低了上皮细胞和MCC细胞系中的TLR-9表达[11]。BJ-5ta细胞中LT和sT的表达增加了调节细胞生长和细胞运动的基因表达[13]。与此同时,还诱导了多种炎症反应基因的表达,包括干扰素刺激基因(ISGs)如OAS1和ISG20,细胞因子如IL-1β和IL-6,以及趋化因子如CXCL1和CXCL6 [13]。在缺乏LT的情况下,sT的表达通过与PP4C和NEMO的相互作用抑制NF-κB炎症信号传导[14, 15]。

我们应用了一种体外感染模型,将原代皮肤成纤维细胞感染MCPyV [16],观察到在感染的后期阶段cGAS-STING的激活以及NF-κB信号通路的诱导,随后表达了抗病毒的ISGs和内源性炎症细胞因子。有趣的是,在感染的早期阶段并未检测到cGAS/STING的激活[17, 18]。然而,在MCC细胞系中也观察到STING的激活和其逃避机制,MCPyV阳性的MCC细胞系中STING的表达明显降低[17, 18]。此外,MCC肿瘤的HLA-I表达较低,通过给予组蛋白去乙酰化酶抑制剂[19-21]或减少sT的表达[22]可以逆转MCC细胞系中的HLA-I表达。

本研究系统分析了早期病毒基因产物诱导的转录和表观遗传学变化。我们通过在原代细胞中进行过表达分析,并在感染和发病过程中不同时间点进行全基因组RNASeq和ChIPSeq实验,模拟早期病毒区域的病毒基因表达以及在感染和发病过程中对宿主基因表达的调控。因此,我们发现了sT在MCPyV感染和发病过程中的新功能,即sT在很大程度上参与免疫逃逸。有趣的是,sT是一种拮抗剂,可以通过在胞浆中抑制TBK1信号级联来降低由LT或肿瘤表达的LTtr触发的免疫活化。我们的研究结果对于理解MCPyV的感染和持续感染以及MCC的发病机制具有重要意义,肿瘤细胞在肿瘤浸润性T细胞中的识别可能会受到肿瘤微环境中TBK-1/IRF3免疫活化的抑制。

结果

  • MCPyV T抗原的过表达导致nHDFs转录谱的扰动

    为了研究MCPyV T抗原在相关的人类细胞类型中引发的宿主转录和组蛋白修饰的变化,我们使用了之前发表的方法[16]对新生儿皮肤真皮成纤维细胞(nHDFs)进行了MCPyV颗粒的感染。我们成功感染了约5-8%的细胞,通过7天后免疫染色中LT和VP1阳性细胞的比例来衡量(suppl.图S1)。然而,感染细胞的相对数量过低,无法解析宿主基因表达的变化,并且没有出现释放新颗粒和再感染等允许性感染的证据,因为感染细胞的频率随时间不变。类似地,半允许性的体外复制系统也无法获得更高比例的感染细胞或病毒传播[4, 5, 23]。

    为了研究早期病毒基因产物LT、LTtr和sT引发的潜在基因表达变化和染色质改变,我们建立了一个过表达协议,利用慢病毒转导,可以在相关细胞类型中单独或组合转导单个T抗原(图1A)。因此,我们可以控制T抗原表达的量以及转导后的时间点,从而模拟具有早期病毒T-Ag表达的可能的初始感染、长期感染或肿瘤细胞表达模式(图1A-B)。

    我们使用两个独立供体的nHDFs研究了由sT、LT或LTtr诱导的转录和组蛋白修饰的变化,通过Western Blotting和RT-qPCR进行了控制(图1B和suppl.图S2A-E)。随后,我们评估了转导后3天和8天(dpt)的即时转录和组蛋白修饰变化,并对转导后9天(供体I)或12天(供体II)进行下游分析(图1A)模拟癌前状态或癌变状态下的基因表达模式。

    使用RNASeq对nHDFs进行的转录组分析显示,MCPyV T抗原在所有时间点均引发了显著的扰动,表现为大量的差异表达基因(DEGs)(图1C、1D、suppl.表S1)。主成分分析(PCA)显示样本在供体间呈现显著的聚类,表明对慢病毒转导和T抗原表达的敏感性存在差异(suppl.图S2F)。然而,在不同的T抗原中,宿主基因表达的明显差异也在供体特异性聚类中表现出来。

    DEG分析包括具有log2FC >1或<-1且padj. <0.05的基因,显示与向量对照相比,几百个基因在所有情况下都有差异调控的趋势,其中表达增加的基因多于表达下降的基因,尤其是在表达LT或LTtr的细胞中(图1C)。LT表达导致985和837个基因在3或8dpt时显著上调。类似地,LTtr也呈现出相似的情况,在3或9dpt时有659和944个基因显著上调,而在转导了sT、sT+LT或sT+LTtr的细胞中,上调的基因数量较少。有趣的是,在表达LTtr的nHDFs(3d)中,仅有相对较少的基因(n=74)呈现显著下调(图1C)。通过比较热图,我们可以看到在所有情况下存在四个显著的DEG簇(图1D)。根据基因本体论(GO)分析,具有相似生物过程的基因被归类到簇I(蓝色),其中包括先天免疫和类型I干扰素(IFN)反应以及NF-kB信号通路(suppl.图S3和suppl.表S2)。其中大多数基因在sT表达下被下调,而在LT或LTtr表达下被上调(图1D)。另一方面,簇II(品红色)包含参与细胞外基质(ECM)组织、细胞凋亡过程和蛋白稳定化的基因。在簇III(橙色)中,包含参与信号转导、炎症和趋化因子信号、化学运动、细胞分裂和细胞黏附的基因被发现富集(suppl.图S3和suppl.表S2)。其中大多数基因在表达sT 8天或与LTtr共同表达9-12天的细胞中高度上调(图1D),尽管它们在表达模式上有变异。同时,簇IV(黄色)显示与簇III相反的表达模式。这一簇中的基因表达调控细胞黏附炎症反应和增殖(suppl.图S3和suppl.表S2)。

  • T抗原在宿主基因表达模式中展示了共同的但也有独特的改变。

    我们分别对每个包含重复样本的数据集进行了分析,以研究各个T抗原共享的基因集和特定靶向的基因(图2)。不同条件下的DEG,以火山图的形式呈现,如suppl.图S4和表S1所示。其中,我们着重强调了在表达sT的细胞与对照组相比的DEG结果(图3A)。我们对每个条件进行了GO分析(生物过程),使用显著(padj. < 0.05)上调(log2FC > 1)或下调(log2FC < -1)的基因集合,并总结在suppl.表S3中。结果经过筛选,选取了最显著的10个GO术语(FDR < 0.05且基因数量至少为10)(图2)。比较每个情况下获得的GO术语,揭示了LT、LTtr和sT调控的生物过程有相当大的重叠(图2),例如细胞增殖和细胞分裂。值得注意的是,尽管我们和其他研究组先前发现完整LT而非LTtr在包括nHDFs在内的多种细胞系中具有一般的生长抑制作用,但在我们的GO分析中,LT和LTtr在细胞周期调控或细胞分裂方面没有明显差异[5, 24]。

    除了细胞周期调控,我们发现DNA复制和DNA修复等GO术语在所有条件下均高度富集,其中LT和LTtr对DNA复制的影响显著高于sT。有趣的是,LT和LTtr,而不是sT,引起了对DNA损伤响应基因的适度上调。此外,LT、sT和LTtr的存在对转录相关机制产生影响。值得注意的是,sT仅对受其抑制的基因富集了与转录调控相关的GO术语。

    我们的研究结果与其他研究团队进行的先前研究相一致,表明T抗原导致了一组与先天免疫和炎症相关的基因的上调[13, 25]。我们发现,同时表达sT和LT或LTtr的细胞中,与趋化性、对肿瘤坏死因子(TNF)的响应和NfκB信号相关的基因明显增加。此外,我们的研究结果与先前的研究一致,表明sT影响与血管生成和细胞黏附有关的基因[26-28]。我们的研究进一步揭示了LT和LTtr在这些生物过程中的类似破坏作用。

    我们在类型I干扰素(IFN)信号通路基因和先天免疫信号通路中检测到最显著的转录变化。虽然LT/LTtr和sT+LT在早期时间点(3天后)增强并调节了这些过程,但它们在后续时间点与对照组细胞没有显著差异。然而,与LT/LTtr或sT+LT表达相比,单独表达sT反而产生了相反的效果,抑制了与类型I干扰素反应相关的基因,不论时间点如何。

  • MCPyV的sT稳定地抑制了类型I干扰素反应基因的转录。

    图3A中的火山图总结了与短T抗原(sT)在3天和8天后与空载体对照组相比在nHDFs中检测到的所有差异表达基因(DEGs)。值得注意的是,最显著上调的基因包括CCL7、CXCL6和IL1B,这些基因与炎症反应相关。对sT上调基因的基因本体(GO)分析(图S5)证实了先前的研究结果,即sT促进细胞分裂和增殖,干扰转录调控和蛋白质生物合成,以及趋化性[13, 26, 28]。有趣的是,与其他T抗原蛋白相比,sT的表达导致更多的基因表达下调(3天后449个,8天后795个),如图3A所示。对所有被sT显著下调的基因进行的GO分析显示出15个GO术语的富集,其中至少有10个基因,最小的FDR值为0.05,包括与负性转录调控和病毒基因组复制抑制相关的基因(图3B)。此外,细胞黏附、血管生成和发育等GO术语在较晚的时间点(8天后)富集。有趣的是,sT还导致与MHC-I类抗原呈递相关的基因的下调(图3B-C)。我们从参与先天免疫和类型I干扰素信号通路的GO术语中选择了一部分基因,并根据其功能进行分类(图3C)。我们发现,sT抑制了STING1、TRIM56、DDX60和TLR3等与细胞质DNA和RNA感知有关的基因。然而,根据它们对干扰素的响应,大多数基因被归类为干扰素刺激基因(ISGs)[29]。我们还发现了一组与MHC-I信号传导相关的基因簇,如HLA-A、HLA-B、HLA-E、TAP1和B2M(图3C)。

    当我们将经过逆转录聚合酶链反应(RT-qPCR)验证的sT表达的nHDFs中的PRR感知、ISGs和抗原呈递相关基因的特定基因位点与未处理的nHDFs进行比较时,发现sT有效地对抗了通过逆转录聚合酶链反应介导的暂时性抗病毒反应,表明lentiviral transduction (Fig. 3C, 左图)。sT有效地对抗了这种抗病毒反应信号,尤其是在8天后的时间点,通过抑制先天免疫基因的表达(Fig. 3C, 右图)。通过RT-qPCR验证了在sT表达的nHDFs中检测到的基因表达下调,包括PRR感知、ISGs和抗原呈递相关基因(suppl. Fig. S6A),并在MCC细胞系中得到了证实,这些细胞系中sT的表达可以通过诱导sT特异性shRNA的多西环素进行下调(suppl. Fig. S6B)。

  • MCPyV的sT特异性地抑制ISGF3依赖基因的转录。

    我们发现,在对lentiviral转导的反应中,sT稳定地抑制了与先天免疫有关的基因的转录,特别是ISG,它们在响应I/II型干扰素信号时被激活。确实,将这些基因与干扰素数据库[29]进行交叉引用后发现,25-35%的下调基因被分类为I/II型干扰素依赖基因(Fig. 4A)。

    为了确定sT调控的基因是否受干扰素刺激基因因子3(ISGF3)的调控,这是激活ISG所必需的关键转录因子复合物,我们将sT调控的ISGs与先前报道的ISG数据集进行了比较。我们选择了来自人类WT和IRF9-KO巨噬细胞的RNASeq数据集[30, 31]。重组IFNbeta的处理结果显示出四个主要的基因簇:IRF9独立上调和下调基因以及IRF9依赖的上调和下调基因。通过计算奇比(OR)和p值,我们将这些基因集与sT在3和8天后上调和下调的基因进行了比较(Fig. 4B)。我们观察到IRF9依赖和IFNbeta上调基因与sT下调基因的重叠,3天后的OR为15,8天后的OR为6.9(Fig. 4C)。总共,381个IRF9依赖和被IFNbeta处理上调的基因中,有65个与sT在3和8天后下调的基因重叠。相反,只有4个基因被sT上调(Fig. 4D和suppl. Tab. S5)。有趣的是,在由IFNbeta处理上调的IRF9独立基因中,有18个基因被sT下调,包括STAT1和IRF9,以及PARP9和DTX3L,这些基因据报道也参与了ISG的调控[32]。

在案例对照研究或横断面研究中,比值比(Odds Ratio,OR)是衡量暴露因素(自变量)与结果(因变量)之间关联程度的一种指标。比值比的计算公式如下: OR = (a/c) / (b/d) 其中: a:暴露组中患病案例的数量 b:未暴露组中患病案例的数量 c:暴露组中未患病案例的数量 d:未暴露组中未患病案例的数量 比值比可以使用统计软件或在线计算器进行计算。或者,您可以使用上述公式和研究中的相关数据手动计算比值比。以下是一个示例: 假设在一项研究中,我们想要研究吸烟与肺癌之间的关系。我们随机选择了500名肺癌患者和500名非肺癌患者,并观察他们的吸烟史。结果如下: 暴露组(吸烟者)中有200名肺癌患者(a) 未暴露组(非吸烟者)中有50名肺癌患者(b) 暴露组中有100名非肺癌患者(c) 未暴露组中有250名非肺癌患者(d) 根据上述数据,我们可以计算比值比如下: OR = (200/100) / (50/250) = 4 这表示吸烟与肺癌之间的关联程度为4,即吸烟者患肺癌的几率是非吸烟者的4倍。

  • MCPyV sT诱导的转录改变与激活性组蛋白修饰相关。

    基于先前发表的有关sT干扰转录过程的数据,例如sT通过与MYC-MAX-EP400转录因子复合物在特定基因启动子上的结合来触发转录的激活 [24],或者通过转激活赖氨酸特异性去甲基化酶LSD-1抑制复合物而导致转录抑制 [1, 33],我们进一步研究了组蛋白修饰在sT依赖的差异表达基因的启动子调控中的作用。我们在表达sT的nHDF细胞中使用针对H3K4me3和H3K27ac的抗体进行ChIP-Seq实验,这两种抗体在活性基因启动子中均存在。此外,我们使用针对H3K27me3和H3K9me3的抗体来识别顺式异染色质和构成性异染色质的变化。大部分H3K4me3的变化发生在启动子区域,而差异化的H3K27ac、H3K27me3和H3K9me3信号则在基因体内或基因间区域内检测到。我们只观察到少数几个具有抑制性组蛋白修饰标记的变化,log2FC <-1; >1相反地,我们检测到大量具有激活性组蛋白修饰(H3K4me3和H3K27ac)变化的基因(图5A,附表S6)。有趣的是,仅有19个注释基因检测到H3K4me3的增加(log2FC >1),而678个基因检测到H3K4me3的减少。同样,H3K27ac增加了1707个基因的修饰,而减少了6853个基因的修饰(图5A)。

    对比转录状态和每种分析的组蛋白修饰的富集度,发现H3K4me3、H3K27ac和H3K27me3之间存在较强的相关性,但H3K9me3没有(图5B)。为了测试GO术语中的基因是否与组蛋白修饰信号的变化相关,我们量化了所有贡献于特定GO术语的基因的转录起始位点(TSSs)(<= 10 kb)周围的组蛋白修饰信号,这些基因列在附表S6中。我们发现,sT下调基因与天然免疫过程、转录调控和细胞粘附等GO术语的减少水平的H3K4me3和H3K27ac水平相关。然而,在这些特定基因集的启动子附近或位置上,我们没有检测到任何抑制性组蛋白修饰标记。相反,sT上调基因和富集在细胞周期与分裂以及蛋白质生物合成等GO术语中的基因在激活性组蛋白修饰中没有显示出显著增加。这些观察结果也适用于一个随机选择的在sT表达下没有显著变化的基因子集(图5C)。对于nHDF的供体II获得的数据进行分析时,得到了类似的结果(附图S7)。

  • MCPyV sT通过特异性靶向ISGF3来抑制ISG转录,而不干扰ISGF3的功能。

    转录组分析显示,MCPyV sT抑制多种ISG的转录,这些ISG的激活是由包括pSTAT1、pSTAT2和IRF9的ISGF3转录因子复合物介导的。在定位到细胞核后,ISGF3结合ISG启动子中的ISRE并激活其转录。另外,pSTAT1可以二聚化并导致一些ISG启动子中GAS元件的激活。基于sT下调的ISGF3依赖基因的强烈信号,我们假设sT可能通过特异性靶向ISGF3来抑制ISG转录。我们通过RT-qPCR验证了ISGF3成员IRF9和STAT1的mRNA水平以及OAS2和IRF7作为ISG下游表达的代表性基因的显著降低(图6A)。此外,免疫荧光研究显示,在表达sT的细胞中,整体IRF9信号减弱,并且与向量对照相比,表达sT的细胞中3天后核内的IRF9信号显著减弱(图6B)。然而,当我们对转染sT并在IFNbeta处理16小时的H1299细胞进行细胞分离时,我们并未发现核内或细胞质中IRF9水平的改变(图S8,lane 10-12和4-6),这表明sT并不干扰IRF9的核转位。当我们检测STAT1的磷酸化时,在2天后与向量对照相比,sT表达的nHDFs中仅有轻微的降低。此外,IRF9的蛋白浓度在2天后也稍有降低。然而,在8天后观察到IRF9蛋白水平的显著降低(图6C)。这些结果强烈表明,sT通过在转录水平上影响ISGF3成员的表达,进而在后续时间点降低总蛋白水平(图6C)。

  • MCPyV sT通过干扰TBK1水平及上游来对抗I型干扰素(IFN)信号。

    MCPyV sT在细胞核中具有许多功能,如转录调控,例如通过招募转录因子和染色质重构复合物EP400/MYC/MAX。然而,它也具有胞质功能,包括与适配蛋白NF-κB必需调节因子(NEMO)蛋白的结合,导致炎症基因的表达抑制。为了澄清sT在启动子水平上的转录调控是否直接导致干扰素刺激基因(ISG)的转录抑制,还是由于IFNARI/II的上游,我们在H1299细胞中进行了基于荧光素酶的启动子实验,使用人类IFNbeta启动子和三个代表性的ISG启动子:人类IFIT1和IFIT2启动子以及小鼠MX1启动子。在这里,我们调查了当IFNARI/II通路被刺激时,sT是否可以对抗ISG启动子的活性。与已知的IFNAR信号拮抗剂小鼠巨细胞病毒CMV-M27蛋白不同[12],过表达未标记或带V5标记的MCPyV sT并未显著降低H1299细胞中ISG启动子的活性(图7A-C)。

    接下来,我们评估了人类IFNbeta启动子,该启动子包含多个转录因子的结合位点,包括NF-κB、IRF3或IRF7。当过表达一个构成活化的IRF3形式(IRF3-5D)[34]时,在表达sT的H1299细胞中,并未观察到IFNbeta启动子活性的显著降低,这表明sT并不干扰IRF3驱动的IFNbeta启动子活化(图7D)。有趣的是,当我们过表达TBK1,这是IRF3上游的关键信号因子时,我们发现IFNbeta启动子的激活显著降低,未标记的sT与带有V5标记的sT相比,降低更为明显(图7E),这表明sT干扰IFNbeta信号在TBK1上游或TBK1水平。同样地,当表达已知在TBK1水平调节I型IFN响应的人类巨细胞病毒UL35时,在共表达TBK1时,IFNbeta启动子活性显著降低(图7E),但在表达IRF3-5D时没有影响(图7D)。如预期,在两个实验中,作为已知IRF3和RIG-I拮抗剂的流感A病毒(IAV)NS1蛋白都对IFNbeta启动子产生显著影响(图7D-E)。

    TBK1是活化的胞质模式识别受体(PRR)刺激下游的关键激酶,例如通过cGAS-STING介导的DNA感知和RIG-I-MAVS介导的RNA感知。为了研究MCPyV sT表达对TBK1下游信号事件的影响,我们检查了cGAS-STING和RIG-I信号传导,这两个信号都汇合于激活TBK1。我们在HEK-293细胞中过表达cGAS-STING或RIG-I,并使用IFNbeta启动子进行荧光素酶活性测定。共表达sT导致DNA(图7F)和RNA(图7G)激活信号的荧光素酶活性显著降低。类似的效应也观察到了IAV NS1蛋白,验证了它已知的拮抗IRF3和RIG-I的能力[36, 37]。此外,在HEK293细胞中过表达TBK1也证实了MCPyV sT显著降低IFNbeta启动子的活性(图7H),从而表明sT对不同细胞系中的IFNbeta启动子具有一致的调控。

    为了研究BKPyV sT对IFNbeta和ISG启动子活性的影响,已报道其干扰RIG-I信号[38],我们检查了过表达带V5标记的BKPyV sT在H1299细胞中的影响。细胞分离显示,BKPyV与MCPyV sT一样,在转染后同时在胞质和细胞核分离物中表达(图S9)。同样,在表达V5标记的BKPyV sT时,我们没有观察到IFIT1、IFIT2或MX1启动子活性的显著降低。然而,在共表达TBK1时,我们观察到IFNbeta启动子活性显著降低,而在表达IRF3-5D时没有影响(图S10)。这些发现表明,BKPyV和MCPyV sT可能在干扰胞质DNA和/或RNA感知引发的I型IFN信号中具有保守的作用。

    总的来说,我们的发现表明,MCPyV sT通过干扰TBK1的上游或TBK1水平的信号事件来影响I型干扰素(IFN)信号传导,特别是在胞质DNA或RNA感知的情况下。这种干扰最终导致IFNbeta的产生减少,并进而抑制ISG的产生。

  • MCPyV sT抵消了LT和LTtr对ISG转录的刺激作用。

    观察到MCPyV sT干扰宿主抗病毒免疫反应,促使我们研究其在感染和发病过程中的功能后果。在感染过程中,sT和LT均表达并且对病毒基因组复制是必需的。此外,在肿瘤细胞中,sT和LTtr的表达对于细胞增殖至关重要。考虑到我们实验设置中的这两种情况,我们专注于一组基因,这些基因在与先天免疫应答、对病毒的应答和对病毒的防御等基因本体学术语中富集(图2)。我们发现这些基因展示了相反的调控模式。通过比较它们的log2FC,我们证实LT、LTtr以及sT+LT的共表达在3天后的转录过程中导致了大多数基因的显著上调。相反,仅表达sT导致这些基因的下调。此外,在8-12天后,我们观察到仅sT的表达导致了这些基因的抑制(图8)。在所示的基因本体学术语中(suppl. Tab. S4)的大部分基因被分类为干扰素刺激基因(ISGs)[39]。值得注意的是,当比较IFNalpha和IFNbeta基因的调控时,我们发现LT和LTtr在3天后特异性上调了IFNbeta的表达(图8)。为了验证所选ISG(IFI6、IRF9、ISG15、OAS2和STAT1)的相反调控,我们对RNASeq中分析的两个不同nHDF供体进行了RT-qPCR(suppl. Fig. S11A)。有趣的是,我们观察到供体I中T抗原表达效应的稍微延迟与供体II相比。然而,在PCA中,包括图8中显示的ISG子集,我们没有检测到单独的供体簇(suppl. Fig. S11B)。

    总而言之,我们发现LT和LTtr在原代成纤维细胞中增强了抗病毒反应,而sT通过抑制IRF3-TBK1信号传导特异性地抑制ISG的表达。我们认为通过这种机制,sT可以在感染情况下抵消LT引起的短暂I型干扰素反应,并在MCC的情况下抵消LTtr引起的干扰素反应。

讨论

Merkel细胞多瘤病毒具有有限的编码能力,包括五种病毒蛋白和一种病毒miRNA。其中,早期蛋白sT和LT在感染和发病过程中起着关键作用,具有多功能性。在感染过程中,LT在病毒转录和病毒DNA复制中起着重要作用[2, 4, 40, 41],sT在其中发挥支持作用,尽管其确切作用尚未完全理解[2, 41]。在发病方面,目前的数据表明,sT主要推动细胞转化和疾病进展。然而,肿瘤细胞的生存能力取决于sT和肿瘤特异性的LTtr[42, 43]。

sT和LT在感染过程中以紧密调节的平衡状态共同工作,促进病毒复制。然而,在发病过程中,这种相互作用发生了显著变化,部分原因是由于LT的C-末端选择压力和LT蛋白的截短版本的表达[1, 24, 44]。为了解决先前感染模型的局限性,研究主要通过在BJ-5ta细胞、hTERT永生化成纤维细胞中进行过表达实验来研究早期病毒蛋白对宿主反应的影响[13, 45]。这些研究仅考虑了早期病毒蛋白表达的特定组合,可能不能完全代表实际的感染或转化条件。类似地,对于sT的转化特性的研究主要集中在理解sT单独调控转录激活和抑制的机制,而没有考虑LT或LTtr表达的背景[45, 46]。 在这项研究中,我们对MCPyV-T抗原存在时发生的转录和表观遗传变化进行了彻底的分析。我们的研究重点放在原代皮肤成纤维细胞上,这是一种与感染和转化相关的细胞类型[16]。通过研究仅早期抗原的效应以及它们的组合效应,我们对病毒与宿主免疫反应之间复杂的相互作用有了新的认识。sT在多个层面上操纵宿主特异性免疫反应,例如作为强效的类型I干扰素(IFN)信号的拮抗剂。在我们的实验设置中,我们观察到sT对PRR感应分子(包括TLR3、STING、TRIM56和DDX60)的基因表达进行了负调节。此外,我们还发现sT降低了ISG的表达水平,并减少了抗原呈递和抗原加工因子的水平,包括多种HLA分子和TAP1。这些发现进一步阐明了病毒与免疫系统之间的复杂相互作用,这对于建立感染以及病毒的发病过程都是重要的。

我们的研究结果确认并扩展了Krump等人最近发表的数据的范围和机制,他们在原代HDF细胞中应用了复杂而具有挑战性的MCPyV感染模型,导致细胞亚群的感染,但未普遍感染(参见补充图S1)[25, 41]。通过使用RT-qPCR,Krump等人发现,在MCPyV感染后六天,关键的抗病毒ISG和炎症细胞因子表达水平增加,STING-TBK1-IRF3成为MCPyV感染触发的重要轴。值得注意的是,在感染的早期阶段,这些基因的表达水平没有明显上调。在我们的研究中,我们提供了证据表明,sT功能是阻断诱导ISG的信号级联。具体而言,我们发现这种调控发生在细胞质中,在TBK1的水平上。我们的结果为类型I干扰素反应的触发时间延迟和激活机制提供了可能的解释。在感染的早期阶段,当sT和LT的表达水平保持平衡时,sT有效地抑制了LT引发的PRR信号传导。sT与病毒因子之间的这种动态平衡使得sT能够有效地削弱干扰素反应。然而,随着感染的进行和LT诱导病毒DNA复制,sT与其他病毒因子(如LT和新产生的病毒基因组)之间的平衡发生变化。在这个阶段,平衡转向可能触发PRR信号的因子,使得sT单独无法有效地抑制干扰素反应。事实上,已经证明BKPyV和JCPyV感染在不同细胞类型中可能引发不同的ISG反应[7, 47, 48]。例如,在某些细胞类型(如小鼠成纤维细胞或内皮血管细胞)中,BKPyV和JCPyV或BKPyV感染分别诱导ISG的表达[48]。然而,在原代RPTE细胞中,BKPyV无法诱导ISG的表达,而JCPyV可以[7, 47]。这表明这些多瘤病毒引发ISG反应的能力可能与细胞类型相关。

我们对sT调控的ISG签名进行了更详细的分析,并与IRF9敲除实验的数据集进行了比较[49]令人惊讶的是,除了ISGF3诱导的ISG之间存在高度一致性外,我们还确定了一个较小的IRF9独立调控的ISG亚组(Fig. 4,suppl. Tab. S5)。值得注意的是,在这些IRF9独立的ISG中,我们确定了PARPL和DDX3L。这两个因子已知形成一个复合物,在病毒感染引发的STAT1介导的类型I干扰素反应中作为伴侣能增强干扰素的反应。有待进一步研究来调查该复合物如何影响MCPyV引发的干扰素反应以及所涉及的分子相互作用。

在我们对sT调控的基因的转录起始位点的组蛋白修饰研究中,我们观察到sT通过失去激活标记来负调控基因表达。具体来说,我们发现H3K27乙酰化(H3K27ac)减少以及H3K4三甲基化(H3K4me3)水平降低。这些观察结果表明,sT可能参与调控开放和可及性较高的启动子,而不是在这些时间点诱导转录抑制性组蛋白修饰的招募。

与先前关于Merkel细胞癌(MCC)细胞中HLA分子下调的发现一致,我们的研究证实sT在调节各种细胞表面分子的呈现方面起到一定作用,影响转移和免疫识别[19-22]。特别是,我们能够重新确认sT在MCC肿瘤细胞上调控各种HLA分子的表达,从而影响其免疫识别,正如我们最近所展示的[22]。其他研究也显示了MCC肿瘤中HLA分子的下调[19-21]。虽然我们的工作证明了这种调控是发生在转录水平,并与激活性组蛋白修饰的丧失相关,但李和他的同事显示在HLA启动子上有一个依赖PRC1.1的H3K27me3修饰的Polycomb复合物的沉积,该修饰可以通过抑制Usp7而逆转[21]。我们观察结果与在MCC细胞系中获得的结果之间的差异可能归因于不同的细胞类型,更重要的是,MCC细胞系是长期变化的反映,其中一些变化是由其他机制调控的。

与MCC细胞系中HLA表达的失调类似,STING也被报道在MCC细胞系和MCC组织中受到抑制,类似于其他癌症类型[6, 18, 25]。我们在这里提供的证据表明,MCC和MCPyV感染中观察到的STING调控失调是sT诱导的IFN依赖基因包括STING的抑制的结果。相应地,当我们在MCC细胞系中下调sT的表达时,我们观察到STING的表达增加。sT调控STING表达的能力表明MCPyV逃避宿主免系统的能力,从而促进MCC的进展。因此,通过靶向sT或其相关的通路,以抵制病毒阳性MCC的进展,可能具有潜在的治疗意义。

虽然已经证明SV40、BKPyV和JCPyV的LT表达能在不同细胞系中诱导ISG的表达[51, 52],但我们是首次展示MCPyV LT表达具有类似的效应。值得注意的是,在先前的研究中,这种ISG反应被特别归因于LT的表达,而不是sT,这是通过cDNA过表达实验得出的结果[48]。虽然增加的ISG表达被归因于SV40 LT诱导的DNA损伤[48, 51],但我们还不知道MCPyV LT诱导的ISG反应的确切机制。因此,进一步探索这一点将是至关重要的,特别是因为MCPyV LTtr也能够诱导ISG反应,尽管与LT不同,LTtr不会增加基因毒性应激[1, 5, 24, 44]。

我们还首次展示了LT或LTtr诱导的一型干扰素反应基因表达受到sT的平衡作用,使sT成为一个关键的免疫逃逸因子。如果有一个全程感染的模型,未来的研究将有必要阐明病毒PAMPs以及sT在感染周期中的全面功能。有趣的是,最近的研究表明JCPyV和BKPyV的sT,但不包括SV40和MCPyV的sT,抑制TRIM25诱导的RIG-I激活以扰乱先天免疫反应,这表明病毒RNA中间体可能是潜在的PAMPs[38]。在我们的实验设置中,慢病毒转导诱导了一型干扰素反应,而LT/LTtr进一步增加了该反应,而sT则扰乱了该反应。我们的IFNbeta和ISG启动子的荧光酶分析以及sT的过表达结果证实了这一点,表明MCPyV和BKPyV的sT能够拮抗TBK1介导的信号传导,暗示着细胞质DNA和RNA的PAMPs识别。

我们的研究的局限性包括由于缺乏感染模型(suppl. Fig. S1),我们进行了过表达实验。然而,我们进行了慢病毒转导,从而避免了非生理性高水平的病毒蛋白表达和随后的细胞反应改变。还可能存在由于供体间的差异而产生的其他限制。这些变异,特别是在免疫

反应方面的差异可能会影响实验结果。因此,我们选取了两个独立供体作为nHDF细胞的来源,在免疫反应方面,我们只在供体I和II之间检测到微小差异。我们在转录组分析中也没有包括其他细胞类型,这些细胞类型可能对病毒蛋白呈现不同的反应,因此这些结果可能不适用于不同的细胞类型或组织。然而,我们已经确认了荧光酶分析的结果,以阐明sT在H1299和293细胞中参与一型干扰素反应的功能,从而证明了sT在多个细胞系中的功能。

这项研究首次显示sT是MCPyV中拮抗一型干扰素信号传导的介质。此外,sT通过IRF3-TBK1通路有效调节核酸诱导的识别途径,并抑制了IFNbeta和ISG的产生。我们通过BKPyV sT的确认结果表明,PyV的sT抗原在微调宿主对病毒复制产物的先天免疫反应方面具有保守的作用。

我们的研究结果表明,MCPyV sT抵消了LT或在肿瘤中由LTtr引起的宿主反应,使sT成为病毒通过免疫逃逸在体内持续存在的关键因素,同时也在肿瘤细胞的免疫逃逸中起到重要作用。这些发现改进了我们对PyV,尤其是MCPyV如何绕过先天免疫反应的理解,特别是在MCC中,并为MCC的治疗提供了潜在的靶点。

材料和方法

  • 质粒和试剂

    第三代慢病毒质粒LeGO-MCPyV sT-iC2 (sT-IRES-mCherry)的详细描述见我们之前的研究 [22]。 MCPyV LT或MCC350衍生的LTtr(MCC350)的cDNA通过限制性酶切与LeGO-iG2(eGFP)进行克隆。 MCPyV sT-co(优化密码子)[41]或BKPyV sT的cDNA通过Gibson装配克隆(ThermoFisher)插入到pcDNA3.1中。

    Luc质粒:…

  • 细胞培养

    从两个不同人类供体购买的新生儿人皮肤成纤维细胞(nHDF)来自Lonza(#CC-2509)。HEK293细胞(ATCC),LentiX-293T细胞(ATCC),H1299细胞(ATCC)和新生儿人皮肤成纤维细胞在标准培养条件下以Dulbecco’s Modified Eagle Medium(DMEM;Gibco)补充10%胎牛血清(FBS;Gibco)和1%青霉素链霉素(Gibco)培养。

    患者来源的MCC细胞系WaGa在Roswell Park Memorial Institute(RPMI;Gibco)培养基中培养。H1299细胞通过聚乙烯亚胺(PEI)转染:将10 µg DNA与不含补充物的DMEM以及PEI按1:10的比例混合,加入到10 cm培养皿中的单层细胞中。对于荧光素酶基于启动子的检测,使用Fugene HD(Promega)按照制造商的说明书,在DNA和Fugene HD的3:1比例下转染H1299或HEK293细胞。慢病毒颗粒的制备是在Lenti-X 293T细胞中进行的,方法如先前描述的那样[23]。 nHDFs的感染方法与先前描述的方法相同[16]。在WaGa细胞中使用shRNA介导的sT敲低方法与最近发表的方法相同[22]。

  • nHDFs的慢病毒转导

    将nHDFs种植在6孔板中,并以MOI为1的慢病毒颗粒转导。为了增加病毒摄取,添加聚苯乙烯磺胺(Polybrene,Sigma)(最终浓度为8 µg/ml),并在37°C下以1000 xg离心30分钟。转导后的两天,对nHDFs进行荧光激活细胞分选(FACS),并继续培养。

  • 基于荧光素酶的启动子测定

    将H1299或HEK293细胞种植在96孔板中,并共转染10 ng pEF1α-RL,100 ng荧光素酶基于pIFNβ-FL或pISG-FL报告质粒,以及100 ng MCPyV sT,BKPyV sT,流感A病毒(IAV)NS1,人巨细胞病毒(CMV)UL35或小鼠CMV M27。对于每个实验,不同的刺激物要么共转染,要么在稍后的时间点添加,如下所述:

    • ISG启动子测定:共转染pGL3basic-hIFIT1-Luc,pGL3basic-hIFIT2-Luc或pGL3basic-mMX1-Luc,并在转染后24小时加入人重组IFNβ(1 ng/ml;PeproTech,#300-02BC)进行16小时,然后在1×被动裂解缓冲(PLB;Promega)中裂解细胞。
    • IFNβ启动子测定:将pGL3basic-IFNβ-Luc与100 ng TBK1或IRF3-5D共转染。另外,转染混合物中还包括100 ng RIG-I-N或50 ng cGAS和50 ng STING。转染后20小时,细胞在1× PLB中裂解。

    使用双荧光素酶报告试剂盒系统(Promega)按照制造商的说明测量荧光素酶活性。在Tecan Infinite® 200 Pro微孔读数器(Tecan,Männedorf,瑞士)中测量荧光素酶信号。萤火虫荧光素酶信号标准化到Renilla荧光素酶信号。

  • 西方印迹分析

    使用RIPA缓冲液(50 mM Tris-HCl,pH 7.5,150 mM NaCl,1 mM EGTA,1%NP-40,0.5%Na-脱氧胆酸,0.1%SDS,1 mM NaF,2 mM β-甘油磷酸,1 mM Na3VO4,0.4 mM PMSF,cOmplete蛋白酶抑制剂混合物(Roche))进行细胞裂解,如先前描述的方法[23]。或者,将细胞在1× PLB中裂解20分钟,然后在4°C下以11,000 xg离心10分钟。使用核和细胞质提取试剂盒(ThermoFisher)进行细胞亚细胞分离。每个SDS凝胶中加载30-50 µg蛋白质,并使用以下一抗体进行Western印迹:鼠抗体,α-LT Cm2B4(1:1000;Santa Cruz),α-sT 2T2 [53]和α-actin C4(1:10,000;Chemicon)。此外,使用兔抗体,α-pSTAT1(1:1000;Cell Signaling),α-STAT1(1:000;Cell Signaling),α-IRF9(1:1000;Cell Signaling),α-V5(1:1000;ThermoFisher),α-Lamin B1(1:2000;abcam,由Stefan Linder提供),以及α-HSP90(1:1000;abcam)。次级抗体使用如下:辣根过氧化物酶(HRP)结合的鼠IgG(1:10,000;GE Healthcare)或兔IgG(1:3000;Cell Signaling)抗体。

  • 免疫荧光染色和共聚焦显微镜

    对于免疫荧光分析,nHDFs生长在明胶涂层的玻璃片上。使用4%PFA固定细胞,并使用以下一抗体进行免疫荧光染色:鼠抗体α-LT Cm2B4(1:1000;Santa Cruz),兔抗体α-VP1(1:500;Christopher B. Buck,NCI)和α-IRF9(1:1000;Cell Signaling)。作为二级抗体,使用Alexa-Fluor标记的α兔IgG-647,α兔IgG-560或α鼠IgG-488,浓度为1:500。使用共聚焦激光扫描显微镜(DMI 6000,Leica TCS SP5双层扫描器,Leica,Wetzlar,德国)获取图像,并使用油浸40HCX或63HCX PL CS plan-apo目镜。图像分析使用Volocity Demo Version 6.1.1(Perkin Elmer,Waltham,MA)进行。使用QuPath软件[54]对IRF9核信号进行定量分析。从每个条件中分析50-150个细胞,至少分析三个玻璃片。

  • 定量实时聚合酶链反应(qPCR)

    用RNeasy提取试剂盒(Qiagen)和DNA- free™ DNA removal kit(Invitrogen)从细胞中提取总RNA。对于逆转录(RT)-qPCR,使用Superscript IV(Invitrogen)按照制造商的说明将1.7 µg RNA转录为cDNA。使用SYBR Green(Thermo Fisher)进行定量实时聚合酶链反应。

    • HPRT:5′-TGACCTTGATTTATTTTGCATACC-3’(前),5′-CTCGAGCAAGACGTTCAGTC-3’(后),
    • TBP:…

    此外,使用以下探针进行TaqMan实验(Thermo Fisher):MCPyV VP1:VIC-GTGGCGCTGAGTACGTCGTGGAGTC,人类GAPDH:6FAM-GATCTGGAGATGATCCCTTTGGCTG。

    反应在Rotor-Gene Q-plex(Qiagen)中进行。使用Rotor-Gene Q Series软件(版本2.3.1)对数据进行分析,并以至少两个参考基因(包括TBP、HPRT和GAPDH)进行归一化。通过qPCR分析感染nHDFs培养液中的MCPyV基因组拷贝数,以VP1与GAPDH归一化。本研究使用的引物和探针列在补充表格S7中。

  • RNA测序

    使用Bioanalyzer和RNA Nano试剂盒(Agilent)检测分离的RNA。对通过RNA完整性评分(RIN)为0.7的所有样本,进行1µg RNA的RNA-Seq文库制备。使用NEBNext Poly(A)磁性分离模块(NEB)进行mRNA富集。根据制造商的方案,使用NEXTFlex快速定向qRNA-Seq试剂盒(Bioo Scientific)进行RNA文库制备。采用Illumina HiSeq 2500平台(SR50)进行测序。使用STAR 2.6.0c工具将测序读数映射到人类基因组(hg19)。采用DESeq2进行差异基因表达(DEG)分析,如前所述。使用DAVID工具进行基因本体(GO)分析,使用padj <0.05和log2FC >1/<-1的DEGs。选择与生物过程相关的GO术语,其FDR或p值<0.05,并且在GO术语中至少包含五个基因。GO术语分析的所有未经筛选的表格均显示在补充表S3中。使用R Studio生成火山图、热图和气泡图。使用Python创建PCA图。

    IRF9 ko比较

  • 染色质免疫沉淀测序

    进行原代染色质免疫沉淀(nChIP),根据[58]的描述,每个IP使用20万个nHDFs作为起始材料。以下抗体用于nChIP反应:兔mAb α-H3K4me3(1:200;Merck Millipore)、α-H3K27me3(1:300;Cell Signaling)、α-H3K9me3(1:200;Active Motif)、α-H3K27ac(1:200;abcam)和IgG(1:200;Merck Millipore)。根据制造商的说明,使用NEXTFlexTM ChIP-Seq文库制备试剂盒(Bioo Scientific)制备提取的DNA的文库。样品在Illumina HiSeq2500上进行测序,使用BWA工具[59]将读数映射到hg19。使用MACS2 [60]进行H3K4me3和H3K27ac峰值调用,使用SICER [61]调用更广泛的峰值,例如H3K9me3和H3K27me3。根据[62]的描述进行diffReps分析。分析和可视化使用R Studio和EaSeq [63]进行。

  • 统计分析

    使用GraphPad Prism(版本9.0,GraphPad Software,San Diego,CA)进行统计分析。

Draw Venn Diagrams using matplotlib

Venn_Diagram_NHDF_vs_HEK293_vs_PFSK-1

NHDF-LT_vs_NHDF-LT-K331A

Venn_Diagram_Peak_Consistency_Between_NHDF-LT-K331A_Donors

Venn_Diagram_Peak_Consistency_Between_NHDF-LT_Donors

Venn_Diagram_Peak_Consistency_Between_PFSK-1-LT+sT_Replicates

Venn_Diagram_Peak_Consistency_Between_HEK293-LT+sT_Replicates

#peaks_PFSK-1.txt   peaks_HEK293.txt    peaks_NHDF.txt  Total   Name
#       X   950 peaks_NHDF.txt
#   X       964 peaks_HEK293.txt
#   X   X   68  peaks_HEK293.txt|peaks_NHDF.txt
#X          6155    peaks_PFSK-1.txt
#X      X   653 peaks_PFSK-1.txt|peaks_NHDF.txt
#X  X       920 peaks_PFSK-1.txt|peaks_HEK293.txt
#X  X   X   267 peaks_PFSK-1.txt|peaks_HEK293.txt|peaks_NHDF.txt
#-------
import matplotlib.pyplot as plt
from matplotlib_venn import venn3

# Define the sizes of the sets
set1_only = 950  # Size of NHDF 
set2_only = 964   # Size of HEK293
set3_only = 6155   # Size of PFSK-1

# Define the sizes of the overlapping regions
shared_elements_12 = 68   # Size of the overlap between set 1 and set 2
shared_elements_13= 653   # Size of the overlap between set 1 and set 3
shared_elements_23 = 920   # Size of the overlap between set 2 and set 3
shared_elements_123 = 267  # Size of the overlap between all three sets

## Define the sizes of the datasets
#set1_size = 100
#set2_size = 80
#set3_size = 60
#shared_elements_12 = 30
#shared_elements_13 = 20
#shared_elements_23 = 15
#shared_elements_123 = 10
## Calculate the sizes of the individual and overlapping sets
#set1_only = set1_size - shared_elements_12 - shared_elements_13 - shared_elements_123
#set2_only = set2_size - shared_elements_12 - shared_elements_23 - shared_elements_123
#set3_only = set3_size - shared_elements_13 - shared_elements_23 - shared_elements_123

# Create the Venn diagram
venn3(subsets=(set1_only, set2_only, shared_elements_12, set3_only, shared_elements_13, shared_elements_23, shared_elements_123), set_labels=('NHDF LT', 'HEK293 LT+sT', 'PFSK-1 LT+sT'))
#venn3(subsets=(set1_size, set2_size, set1_set2_size, set3_size, set1_set3_size, set2_set3_size, set1_set2_set3_size), set_labels=('NHDF', 'HEK293', 'PFSK-1'))

# Set the title and labels
plt.title('', fontsize=16)  #Venn Diagram of Three Datasets
plt.xlabel('Samples')
plt.ylabel('Count')

# Show the Venn diagram
#plt.show()
plt.savefig('Venn_Diagram_NHDF_vs_HEK293_vs_PFSK-1.png', dpi=300, bbox_inches='tight') 

#peaks_K331A.txt    peaks_NHDF.txt  Total   Name
#   X   1818    peaks_NHDF.txt
#X      261 peaks_K331A.txt
#X  X   120 peaks_K331A.txt|peaks_NHDF.txt
import matplotlib.pyplot as plt
from matplotlib_venn import venn2

# Define the sizes of the datasets
set1_size = 1938
set2_size = 381
shared_elements = 120

# Calculate the sizes of the individual and overlapping sets
set1_only = set1_size - shared_elements
set2_only = set2_size - shared_elements

# Create the Venn diagram
venn2(subsets=(set1_only, set2_only, shared_elements), set_labels=('NHDF LT', 'NHDF LT K331A'))

# Set the title and labels
plt.title('')  #NHDF LT vs NHDF K331A
plt.xlabel('Number of Elements')
plt.ylabel('')

# Display the Venn diagram
#plt.show()
plt.savefig('Venn_Diagram_NHDF-LT_vs_NHDF-LT-K331A.png', dpi=300, bbox_inches='tight') 

## under directory "results_ChIPseq_NHDF_LT_K331A_PFSK-1_HEK293_hg38/homer"
#grep "NHDF_K331A_DonorI/peaks.txt" peaks_K331A_LT.txt | wc -l  # 224-186=38
#grep "NHDF_K331A_DonorII/peaks.txt" peaks_K331A_LT.txt | wc -l # 343-186=157
#grep "NHDF_K331A_DonorI/peaks.txt|NHDF_K331A_DonorII/peaks.txt" peaks_K331A_LT.txt | wc -l #186
##38+157+186=381
import matplotlib.pyplot as plt
from matplotlib_venn import venn2
venn2(subsets=(38, 157, 186), set_labels=('Donor I', 'Donor II'))
plt.title('Peak Consistency Between NHDF LT K331A Donors (Total 381 peaks)') 
plt.xlabel('Number of Elements')
plt.ylabel('')
plt.savefig('Venn_Diagram_Peak_Consistency_Between_NHDF-LT-K331A_Donors.png', dpi=300, bbox_inches='tight') 

#grep "NHDF_LT_DonorI/peaks.txt" peaks_NHDF_LT.txt | wc -l  # 1440-969=471
#grep "NHDF_LT_DonorII/peaks.txt" peaks_NHDF_LT.txt | wc -l # 1467-969=498
#grep "NHDF_LT_DonorI/peaks.txt|NHDF_LT_DonorII/peaks.txt" peaks_NHDF_LT.txt | wc -l #969
##471+498+969=1938
import matplotlib.pyplot as plt
from matplotlib_venn import venn2
venn2(subsets=(471, 498, 969), set_labels=('Donor I', 'Donor II'))
plt.title('Peak Consistency Between NHDF LT Donors (Total 1938 peaks)')
plt.xlabel('Number of Elements')
plt.ylabel('')
plt.savefig('Venn_Diagram_Peak_Consistency_Between_NHDF-LT_Donors.png', dpi=300, bbox_inches='tight') 

#grep "PFSK-1B_LT+sT_r1/peaks.txt" peaks_PFSK-1_LT+sT.txt | wc -l  # 4814-2927=1887
#grep "PFSK-1B_LT+sT_r2/peaks.txt" peaks_PFSK-1_LT+sT.txt | wc -l  # 6108-2927=3181
#grep "PFSK-1B_LT+sT_r1/peaks.txt|PFSK-1B_LT+sT_r2/peaks.txt" peaks_PFSK-1_LT+sT.txt | wc -l #2927
##1887+3181+2927=7995
import matplotlib.pyplot as plt
from matplotlib_venn import venn2
venn2(subsets=(1887, 3181, 2927), set_labels=('Replicate 1', 'Replicate 2'))
plt.title('Peak Consistency Between PFSK-1 LT+sT Replicates (Total 7995 peaks)')
plt.xlabel('Number of Elements')
plt.ylabel('')
plt.savefig('Venn_Diagram_Peak_Consistency_Between_PFSK-1-LT+sT_Replicates.png', dpi=300, bbox_inches='tight') 

#grep "HEK293_LT+sT_r2/peaks.txt" peaks_HEK293_LT+sT.txt | wc -l  # 1896-580=1316
#grep "HEK293_LT+sT_r3/peaks.txt" peaks_HEK293_LT+sT.txt | wc -l  # 903-580=323
#grep "HEK293_LT+sT_r2/peaks.txt|HEK293_LT+sT_r3/peaks.txt" peaks_HEK293_LT+sT.txt | wc -l #580
##1316+323+580=2219
import matplotlib.pyplot as plt
from matplotlib_venn import venn2
venn2(subsets=(1316, 323, 580), set_labels=('Replicate 1', 'Replicate 2'))
plt.title('Peak Consistency Between HEK293 LT+sT Replicates (Total 2219 peaks)')
plt.xlabel('Number of Elements')
plt.ylabel('')
plt.savefig('Venn_Diagram_Peak_Consistency_Between_HEK293-LT+sT_Replicates.png', dpi=300, bbox_inches='tight') 

# Draw Venn Diagram using Set rather than Set Size
##38+186+157=381
#import matplotlib.pyplot as plt
#from matplotlib_venn import venn2
#
#set1 = set(range(1, 225))  
#set2 = set(range(39, 382))
#
#venn2([set1, set2], set_labels=('Replicate 1', 'Replicate 2'), set_colors=('skyblue', 'lightgreen'))
#plt.title('Peak Consistency Between NHDF K331A Replicates (total 381 peaks)', fontsize=16)
#plt.xticks(fontsize=12)
#plt.yticks(fontsize=12)
#
##plt.show()  # Display the Venn diagram
#plt.savefig('Venn_Diagram_Peak_Consistency_Between_NHDF_K331A_Replicates.png')  # Save the Venn diagram as an image file