Author Archives: gene_x

Install and configure Docker for ‘nextflow run nf-core/rnaseq’

Install and configure that the packages for Docker (docker-ce, docker-ce-cli, and containerd.io):

  1. Update your system’s package information:

    sudo apt-get update
  2. Uninstall any old versions of Docker:

    sudo apt-get remove docker docker-engine docker.io containerd runc
  3. Set up the Docker repository to get the latest version of Docker. First, install packages to allow apt to use a repository over HTTPS:

    sudo apt-get install apt-transport-https ca-certificates curl gnupg lsb-release
  4. Add Docker’s official GPG key:

    curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
  5. Use the following command to set up the stable repository (Note the use of the signed-by option to point to the keyring file that you have just downloaded):

    echo \
      "deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu \
      $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
  6. Update the apt package index again, and then install Docker:

    sudo apt-get update
    sudo apt-get install docker-ce docker-ce-cli containerd.io

    The Docker installation includes the Docker service (daemon), which allows you to start containers, and the Docker CLI client. The Docker CLI uses the Docker API to interact with the Docker service.

  7. Finally, add your user to the Docker group with the command. The following command cannot solve permission errors, rather than running step 8 can solve the problem.

    sudo usermod -aG docker ${USER}
  8. Run nf-core/rnaseq under dock

    sudo chmod 666 /var/run/docker.sock
    xxxxx@xxx:~/DATA/Data_Manja_RNAseq_Organoids$ nextflow run nf-core/rnaseq --input samplesheet.csv --outdir results_GRCh38 --with_umi --umitools_bc_pattern 'NNNNNNNNNNNN' --fasta "/home/jhuang/REFs/Homo_sapiens/Ensembl/GRCh38/Sequence/WholeGenomeFasta/genome.fa" --gtf "/home/jhuang/REFs/Homo_sapiens/Ensembl/GRCh38/Annotation/Genes/genes.gtf"  --skip_rseqc --skip_dupradar --skip_preseq -profile docker -resume
  9. During the installation of singularity, we got a error as follows.

    sudo apt-get update && sudo apt-get instal -y \
    build-essential \
    libssl-dev \
    uuid-dev \
    libgpgme11-dev \
    squashfs-tools \
    libseccomp-dev \
    wget \
    pkg-config \
    git \
    cryptsetup
    ...
    Processing triggers for initramfs-tools (0.130ubuntu3.6) ...
    update-initramfs: Generating /boot/initrd.img-5.4.0-150-generic
    I: The initramfs will attempt to resume from /dev/dm-9
    I: (UUID=698e2e03-7762-4764-b890-86d234beb938)
    I: Set the RESUME variable to override this.

    The changes here are related to enabling your system to resume from a specific device after suspend. The line “The initramfs will attempt to resume from /dev/dm-9” indicates that if your system goes into a suspend or hibernate state, it’ll try to resume the system state from the device /dev/dm-9. This setting is typically safe and should not cause any problems.

PiCRUST2 Pipeline for Functional Prediction and Pathway Analysis in Metagenomics

error_bar

How to run the software package PiCRUST2 (Phylogenetic Investigation of Communities by Reconstruction of Unobserved States version 2), and to visualize its output data using STAMP (Statistical Analysis of Metagenomic Profiles) and ALDEx2 (Analysis of Differential Abundance taking sample Variation into Account version 2) for Functional Prediction and Pathway Analysis in Metagenomics.

  1. Environment Setup: It sets up a Conda environment named picrust2, using the conda create command and then activates this environment using conda activate picrust2.

    #https://github.com/picrust/picrust2/wiki/PICRUSt2-Tutorial-(v2.2.0-beta)#minimum-requirements-to-run-full-tutorial
    conda create -n picrust2 -c bioconda -c conda-forge picrust2  #=2.2.0_b
    conda activate picrust2
  2. Data Preparation: The script creates a new directory called picrust2_out, then enters it using mkdir and cd commands. It then identifies input files that are needed for the analysis: metadata.tsv, seqs.fna, table.biom. The biom commands are used to inspect and convert the BIOM format files.

    mkdir picrust2_out
    cd picrust2_out
    
    # Identifying input data
    # Note: Replace the paths and filenames with your actual data if different
    # metadata.tsv == ../map_corrected.txt
    # seqs.fna     == ../clustering/seqs.fna
    # table.biom   == ../core_diversity_e42369/table_even42369.biom
    
    # Inspect and convert the BIOM format files
    biom head -i ../core_diversity_e42369/table_even42369.biom
    biom summarize-table -i ../core_diversity_e42369/table_even42369.biom
    biom convert -i ../core_diversity_e42369/table_even42369.biom -o table_even42369.tsv --to-tsv
  3. Running PiCRUST2: The place_seqs.py command aligns the input sequences to a reference tree. The hsp.py commands generate hidden state prediction for multiple functional categories.

    #insert reads into reference tree using EPA-NG
    cp ../clustering/rep_set.fna ./
    grep ">" rep_set.fna | wc -l  #44238
    vim table_even42369.tsv       #40596-2
    
    samtools faidx rep_set.fna
    cut -f1-1 table_even42369.tsv > table_even42369.id
    #manually modify table_even42369.id by replacing "\n" with " >> seqs.fna\nsamtools faidx rep_set.fna "
    run table_even42369.id
    
    rm -rf intermediate/
    place_seqs.py -s seqs.fna -o out.tre -p 4 --intermediate intermediate/place_seqs
    
    #castor: Efficient Phylogenetics on Large Trees
    #https://github.com/picrust/picrust2/wiki/Hidden-state-prediction
    
    hsp.py -i 16S -t out.tre -o 16S_predicted_and_nsti.tsv.gz -p 15 -n
    hsp.py -i COG -t out.tre -o COG_predicted.tsv.gz -p 15
    hsp.py -i PFAM -t out.tre -o PFAM_predicted.tsv.gz -p 15
    hsp.py -i KO -t out.tre -o KO_predicted.tsv.gz -p 15
    hsp.py -i EC -t out.tre -o EC_predicted.tsv.gz -p 15
    hsp.py -i TIGRFAM -t out.tre -o TIGRFAM_predicted.tsv.gz -p 15
    hsp.py -i PHENO -t out.tre -o PHENO_predicted.tsv.gz -p 15

    In this table the predicted copy number of all Enzyme Classification (EC) numbers is shown for each ASV. The NSTI values per ASV are not in this table since we did not specify the -n option. EC numbers are a type of gene family defined based on the chemical reactions they catalyze. For instance, EC:1.1.1.1 corresponds to alcohol dehydrogenase. In this tutorial we are focusing on EC numbers since they can be used to infer MetaCyc pathway levels (see below).

    zless -S EC_predicted.tsv.gz
    sequence        EC:1.1.1.1      EC:1.1.1.10     EC:1.1.1.100    ...
    20e568023c10eaac834f1c110aacea18        2       0       3    ...
    23fe12a325dfefcdb23447f43b6b896e        0       0       1    ...
    288c8176059111c4c7fdfb0cd5afce64        1       0       1    ...
    ...
    
    ##Why import the tsv file to MyData?
    #MyData <- read.csv(file="./COG_predicted.tsv", header=TRUE, sep="\t", row.names=1)   #6806 4598  e.g. COG5665
    #MyData <- read.csv(file="./PFAM_predicted.tsv", header=TRUE, sep="\t", row.names=1)  #6806 11089 e.g. PF17225
    #MyData <- read.csv(file="./KO_predicted.tsv", header=TRUE, sep="\t", row.names=1)    #6806 10543 e.g. K19791
    #MyData <- read.csv(file="./EC_predicted.tsv", header=TRUE, sep="\t", row.names=1)    #6806 2913  e.g. EC.6.6.1.2
    #MyData <- read.csv(file="./16S_predicted.tsv", header=TRUE, sep="\t", row.names=1)   #6806    1     e.g. X16S_rRNA_Count
    #MyData <- read.csv(file="./TIGRFAM_predicted.tsv", header=TRUE, sep="\t", row.names=1)  #6806 4287  e.g. TIGR04571
    #MyData <- read.csv(file="./PHENO_predicted.tsv", header=TRUE, sep="\t", row.names=1)    #6806   41  e.g. Use_of_nitrate_as_electron_acceptor, Xylose_utilizing
  4. The metagenome_pipeline.py commands perform metagenomic prediction for several functional categories. Predicted gene families weighted by the relative abundance of ASVs in their community. In other words, we are interested in inferring the metagenomes of the communities.

    #Generate metagenome predictions using EC numbers https://en.wikipedia.org/wiki/List_of_enzymes#Category:EC_1.1_(act_on_the_CH-OH_group_of_donors)
    metagenome_pipeline.py -i ../core_diversity_e42369/table_even42369.biom -m 16S_predicted_and_nsti.tsv.gz -f COG_predicted.tsv.gz -o COG_metagenome_out --strat_out
    metagenome_pipeline.py -i ../core_diversity_e42369/table_even42369.biom -m 16S_predicted_and_nsti.tsv.gz -f EC_predicted.tsv.gz -o EC_metagenome_out --strat_out
    metagenome_pipeline.py -i ../core_diversity_e42369/table_even42369.biom -m 16S_predicted_and_nsti.tsv.gz -f KO_predicted.tsv.gz -o KO_metagenome_out --strat_out
    metagenome_pipeline.py -i ../core_diversity_e42369/table_even42369.biom -m 16S_predicted_and_nsti.tsv.gz -f PFAM_predicted.tsv.gz -o PFAM_metagenome_out --strat_out
    metagenome_pipeline.py -i ../core_diversity_e42369/table_even42369.biom -m 16S_predicted_and_nsti.tsv.gz -f TIGRFAM_predicted.tsv.gz -o TIGRFAM_metagenome_out --strat_out
  5. Pathway-level inference: By default this script infers MetaCyc pathway abundances based on EC number abundances, although different gene families and pathways can also be optionally specified. This script performs a number of steps by default, which are based on the approach implemented in HUMAnN2:

    • Regroups EC numbers to MetaCyc reactions.
    • Infers which MetaCyc pathways are present based on these reactions with MinPath.
    • Calculates and returns the abundance of pathways identified as present.

      #pathway_pipeline.py -i EC_metagenome_out/pred_metagenome_contrib.tsv.gz -o pathways_out -p 15
      
      #Note that the path of map files is under /home/jhuang/anaconda3/envs/picrust2/lib/python3.6/site-packages/picrust2/default_files/pathway_mapfiles
      (picrust2) pathway_pipeline.py -i COG_metagenome_out/pred_metagenome_contrib.tsv.gz -o KEGG_pathways_out -p 15 --no_regroup --map /home/jhuang/anaconda3/envs/picrust2/lib/python3.6/site-packages/picrust2/default_files/pathway_mapfiles/KEGG_pathways_to_KO.tsv
      
      #Mapping predicted KO abundances to legacy KEGG pathways (with stratified output that represents contributions to community-wide abundances):
      (picrust2) pathway_pipeline.py -i KO_metagenome_out/pred_metagenome_strat.tsv.gz -o KEGG_pathways_out --no_regroup --map /home/jhuang/anaconda3/envs/picrust2/lib/python3.6/site-packages/picrust2/default_files/pathway_mapfiles/KEGG_pathways_to_KO.tsv
      
      #Map EC numbers to MetaCyc pathways and get stratified output corresponding to contribution of predicted gene family abundances within each predicted genome:
      #BUG: CANNOT FINISH in 1 day! (picrust2) pathway_pipeline.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv.gz -o pathways_out_per_seq --per_sequence_contrib --per_sequence_abun EC_metagenome_out/seqtab_norm.tsv.gz --per_sequence_function EC_predicted.tsv.gz
      (picrust2) pathway_pipeline.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv.gz -o pathways_out -p 6
  6. Add functional descriptions: Finally, it can be useful to have a description of each functional id in the output abundance tables. The below commands will add these descriptions as new column in gene family and pathway abundance tables

    #--6.1. Add descriptions in gene family tables
    add_descriptions.py -i COG_metagenome_out/pred_metagenome_unstrat.tsv.gz -m COG -o COG_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz
    add_descriptions.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv.gz -m EC -o EC_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz
    add_descriptions.py -i KO_metagenome_out/pred_metagenome_unstrat.tsv.gz -m KO -o KO_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz   # EC and METACYC is a pair, EC for gene_annotation and METACYC for pathway_annotation
    add_descriptions.py -i PFAM_metagenome_out/pred_metagenome_unstrat.tsv.gz -m PFAM -o PFAM_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz
    add_descriptions.py -i TIGRFAM_metagenome_out/pred_metagenome_unstrat.tsv.gz -m TIGRFAM -o TIGRFAM_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz
    
    #--6.2. Add descriptions in pathway abundance tables
    add_descriptions.py -i pathways_out/path_abun_unstrat.tsv.gz -m METACYC -o pathways_out/path_abun_unstrat_descrip.tsv.gz
    
    #Error - no rows remain after regrouping input table. The default pathway and regroup mapfiles are meant for EC numbers. Note that KEGG pathways are not supported since KEGG is a closed-source database, but you can input custom pathway mapfiles if you have access. If you are using a custom function database did you mean to set the --no-regroup flag and/or change the default pathways mapfile used?
    #If ERROR --> USE the METACYC for downstream analyses!!!
    
    add_descriptions.py -i pathways_out/path_abun_unstrat.tsv.gz -o KEGG_pathways_out/path_abun_unstrat_descrip.tsv.gz --custom_map_table /home/jhuang/anaconda3/envs/picrust2/lib/python3.6/site-packages/picrust2/default_files/description_mapfiles/KEGG_pathways_info.tsv.gz
  7. Visualization

    #7.1 install and open STAMP
    #https://github.com/picrust/picrust2/wiki/STAMP-example
    #install and open STAMP
    conda deactivate
    conda install -c bioconda stamp
    #sudo pip install pyqi
    #sudo apt-get install libblas-dev liblapack-dev gfortran
    #sudo apt-get install freetype* python-pip python-dev python-numpy python-scipy python-matplotlib
    #sudo pip install STAMP
    #conda install -c bioconda stamp
    conda create -n stamp -c bioconda/label/cf201901 stamp
    brew install pyqt
    #DEBUG the environment
    conda install pyqt=4
    #conda install icu=56
    (stamp) jhuang@hamburg:~$ STAMP
    
    #7.2 unzip path_abun_unstrat_descrip.tsv.gz
    gunzip path_abun_unstrat_descrip.tsv.gz
    
    #7.3 prepare metadata.tsv from map_corrected.txt
    vim metadata.tsv
    #SampleID   Genotype    Description
    #S1 Before_non-reducers IDFrancesco1
    #S2 After_non-reducers  IDFrancesco2
    #S3 Before_Reducers IDFrancesco4
    #S4 After_Reducers  IDFrancesco5
    cut -d$'\t' -f1 map_corrected.txt > 1
    cut -d$'\t' -f5 map_corrected.txt > 5
    cut -d$'\t' -f6 map_corrected.txt > 6
    paste -d$'\t' 1 5 > 1_5
    paste -d$'\t' 1_5 6 > metadata.tsv
    #SampleID --> SampleID
    SampleID    Facility    Genotype
    100CHE6KO   PaloAlto    KO
    101CHE6WT   PaloAlto    WT
    
    #7.4(optional) use ALDEx2 rather than STAMP: https://bioconductor.org/packages/release/bioc/html/ALDEx2.html
  8. Explanation of the generated plot from Step 7: Extended error bar.

error_bar

The difference in mean proportions is a statistical measurement that is often used in comparing the proportions of a certain outcome between two groups.

Here's a simple example to explain the concept:

Imagine you conducted a survey on two groups of people, Group A and Group B, asking whether they like a specific brand of chocolate. In Group A, 70 out of 100 people said yes (proportion = 0.7). In Group B, 80 out of 100 people said yes (proportion = 0.8). The difference in proportions is 0.8 - 0.7 = 0.1. This means that in your sample, the proportion of people who like the specific brand of chocolate is 10% higher in Group B compared to Group A.

Statistically speaking, we often want to know whether this difference is significant (i.e., is it likely to be due to chance, or is there a real difference between the groups?). We can use a statistical test, such as a two-proportion z-test, to answer this question.

It's important to note that the difference in proportions is sensitive to the size of your sample. If you have very large groups, even a small difference in proportions can be statistically significant. If you have small groups, only a large difference will be statistically significant.

Evaluating the Proximity of Genomic Features (=integration sites) to Peaks Using a Permutation Test genomic features

The study revolves around the evaluation of the proximity of integration sites to peaks in the human genome, using a permutation-based approach. It involves three primary steps:

  1. Observed Data: We begin by considering our observed data, which are the distances from each integration site to the nearest peak. We compute the mean of these observed distances.

  2. Null Distribution: To generate a null distribution, we perform a permutation test by randomizing the integration sites. For each iteration (in this case, 1,000 iterations to create a robust distribution), we randomly select a number of integration sites equivalent to the count of unique observed distances from the human genome and calculate the distance to the nearest peak. The random integration sites are represented in a BED format (chromosome, start, end). These integration sites, termed as ‘random_integration_sites’, are chosen from defined chromosomal regions that provide the lengths of human chromosomes.

    The calculation of distances proceeds as follows:

    • For each feature (randomly generated integration site), we find the closest peak using the ‘closest’ function from the ‘pybedtools’ library, resulting in a ‘random_closest’ BEDTool object.
    • We extract the distances from each random integration site to its closest peak. These distances are stored in ‘random_distances’.
    • We then calculate the mean of these random distances and store it in ‘random_mean_distances’.
  3. Comparison: We then compare our observed mean distance to the null distribution of mean distances. The output p-value serves as an indicator of statistical significance. If the p-value is small (conventionally, less than 0.05 is considered significant), it suggests that the observed mean distance is significantly different from what would be expected by random chance, indicating that the features are not randomly distributed but are likely to be located closer to peaks in the genome.

    Observed mean distance: 1218081
    
    The statistical results from the 1000 iterations of the permutation test, in which we randomly generated potential integration sites:
    Mean:  1445755
    Standard deviation:  400673
    Minimum:  509078
    Maximum:  2978797
    P-value:  0.313

(Optional) If your observed mean distance is smaller than the smallest mean distance from your null distribution (i.e., the 1000 permutations), this indeed suggests that your observed integration sites are significantly closer to the peaks than would be expected under the null hypothesis.

(Optional) In other words, your observed integration sites are more closely located to the peaks than random integration sites, which supports the conclusion that there is a non-random association between your integration sites and the peaks.

(Optional) A small p-value suggests that the observed mean distance is significantly different from what is expected by random chance. This could mean that the integration sites under study are preferentially located near peaks in the genome. The null distribution is the collection of mean distances we calculated for each permutation. The proportion of mean distances in the null distribution that is greater than or equal to our observed mean distance serves as our p-value.

So the p-value calculation should look something like this:

#pip install statsmodels

import numpy as np
from pybedtools import BedTool
import pprint
from statsmodels.stats.multitest import multipletests
pp = pprint.PrettyPrinter(indent=4)

#sort -k1,1 -k2,2n peaks_on_integrationsites.csv > peaks_on_integrationsites_sorted.bed
#=898046
#1406936,133333333

# Observed distances
#observed_distances = [-4045231,563541,1118767,-1779287,0,-5347653,3935720,1146367,1507718,0,-1826,-7456,81323,68056,1386933,0,-545651,-84468,-652642,351958,218160,5644455,320101,2050624,-418508,-1061416,-351892,-33175,-296551,-138858,2221723,-658351,3419047,-2701162,1295321,4712290,0,1434626,-5479512,1918341,465313,-986431,190096,-566869,-736100,3579169,1087322,-2696342,-1866390,-14123,1250899,-1424025,-929436,232285,232338,3962087,1042645,728148,-163988,-188515,-1445728,-198270,-116532,267672,924015,735666,-1705528,147724,-122133,261167]
observed_distances = [4045231,563541,1118767,1779287,0,5347653,3935720,1146367,1507718,0,1826,7456,81323,68056,1386933,0,545651,84468,652642,351958,218160,5644455,320101,2050624,418508,1061416,351892,33175,296551,138858,2221723,658351,3419047,2701162,1295321,4712290,0,1434626,5479512,1918341,465313,986431,190096,566869,736100,3579169,1087322,2696342,1866390,14123,1250899,1424025,929436,232285,232338,3962087,1042645,728148,163988,188515,1445728,198270,116532,267672,924015,735666,1705528,147724,122133,261167]
#unique_observed_distances = list(set(observed_distances))
observed_mean = np.mean(observed_distances)
print('observed_mean:', observed_mean)

# Load peak ranges from the BED file
peaks = BedTool('peaks_NHDF_.bed').sort()

# Define chrom regions
# 'chrM': 16569, 
#175187.58208955225
chrom_regions = {
    'chr1': 248956422, 'chr2': 242193529, 'chr3': 198295559, 'chr4': 190214555, 'chr5': 181538259,
    'chr6': 170805979, 'chr7': 159345973, 'chr8': 145138636, 'chr9': 138394717, 'chr10': 133797422,
    'chr11': 135086622, 'chr12': 133275309, 'chr13': 114364328, 'chr15': 101991189,
    'chr16': 90338345, 'chr17': 83257441, 'chr18': 80373285, 'chr20': 64444167,
    'chr21': 46709983, 'chr22': 50818468, 'chr14': 107043718, 'chr19': 58617616, 'chrX': 156040895, 'chrY': 57227415
}

# Permutation test parameters  16.4
#620708 --> 42208198/47=898046 --> 1406936
num_permutations = 1000
num_features = len(observed_distances)

random_mean_distances = []
for _ in range(num_permutations):
    # Generate random integration sites
    random_integration_sites = []
    for _ in range(num_features):
        chrom = np.random.choice(list(chrom_regions.keys()))
        position = np.random.randint(0, chrom_regions[chrom])
        # where the range from end to start is always 898046, meaning these represent an average length of the integration sites in the genome
        random_integration_sites.append((chrom, position, position+898046))
    random_integration_sites = BedTool(random_integration_sites)
    random_integration_sites = random_integration_sites.sort()
    #print(random_integration_sites)

    # Find closest peaks for each feature
    random_closest = random_integration_sites.closest(peaks, d=True)

    # Extract distances
    random_distances = [int(i[-1]) for i in random_closest]

    # Calculate distances and their mean
    random_mean_distances.append(np.mean(random_distances))

    # Calculate p-value
    p_values = [np.mean([mean_dist <= observed_mean for mean_dist in random_mean_distances])]
    p_values_corrected = multipletests(p_values, method='fdr_bh')[1]  # Apply BH correction

#pp.pprint(random_mean_distances)
print("Mean: ", np.mean(random_mean_distances))
print("Standard deviation: ", np.std(random_mean_distances))
print("Minimum: ", np.min(random_mean_distances))
print("Maximum: ", np.max(random_mean_distances))
print("Uncorrected p-value: ", p_values[0])
print("Corrected p-value: ", p_values_corrected[0])  # After BH correction

      #[mean_dist >= observed_mean for mean_dist in random_mean_distances]
      #This is a list comprehension that returns a boolean list. For each mean distance in the list random_mean_distances (i.e., mean distances calculated from randomly selected genomic positions), it checks if the mean distance is greater than or equal to observed_mean (i.e., the mean of observed distances from the nearest peak for each integration site). If the condition is met, it returns True (which is equivalent to 1), else it returns False (equivalent to 0).
      #np.mean([...])
      #This is calculating the mean of the boolean list. In this context, the mean of the boolean list is equivalent to the proportion of random mean distances that are greater than or equal to the observed mean distance. This is because True is considered as 1 and False as 0 when calculating the mean. This proportion is used as the p-value.
      #The p-value is a measure of the probability that an observed difference could have occurred just by random chance. The lower the p-value, the greater the statistical significance because it tells the investigator that the hypothesis under consideration may not adequately explain the observation. In most cases, a threshold (alpha) is set at 0.05, meaning that there is a 5% chance that the difference is due to chance alone. If the p-value is less than 0.05, the null hypothesis is rejected.

I performed a statistical test aimed at evaluating whether the observed distances from integration sites to the nearest peaks are significantly different from what might be expected by random chance.

Here are the results:

  • Observed mean distance (calculated from a total of 70 integration sites to their nearest peaks): 1,218,081 nt.

The statistical results from the 1000 iterations of the permutation test are as follows:

  • Mean: 1,445,755
  • Standard deviation: 400,673
  • Minimum: 509,078
  • Maximum: 2,978,797

The P-value is 0.313. It suggests that our observed integration sites are not significantly closer to these peaks than would be expected if the sites were distributed randomly.

The statistical test involves three steps:

  1. Observed Data: We start by considering our observed data, which are the distances from each integration site to the nearest peak. We compute the mean of these observed distances.
  2. Null Distribution: To generate a null distribution, we perform a permutation test with 1000 iterations. For each iteration, we randomly generate 70 positions (as potential integration sites) and then calculate the distance from these positions to the nearest peak.
  3. Comparison: We then compare our observed mean distance to the 1000 mean distances generated in the last step.

Epidome processing

  1. Raw_Data

    #Here are some more information on the two sample collections for epidome sequencing: 9+10+33=52
    
    #Samples 7N-15N are nose swabs of 9 individual patients, and 16-20F/N are feet (F) and nose (N) swabs of 5 more patients. 
    
    #All of these patients were hospitalized for endoprosthesis surgery. 
    
    #nose swaps of 9 individual patients
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1436/7N_S1_R1_001.fastq.gz P7-Nose_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1436/7N_S1_R2_001.fastq.gz P7-Nose_R2.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1437/8N_S2_R1_001.fastq.gz P8-Nose_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1437/8N_S2_R2_001.fastq.gz P8-Nose_R2.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1438/9N_S3_R1_001.fastq.gz P9-Nose_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1438/9N_S3_R2_001.fastq.gz P9-Nose_R2.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1439/10N_S4_R1_001.fastq.gz P10-Nose_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1439/10N_S4_R2_001.fastq.gz P10-Nose_R2.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1440/11N_S5_R1_001.fastq.gz P11-Nose_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1440/11N_S5_R2_001.fastq.gz P11-Nose_R2.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1441/12N_S6_R1_001.fastq.gz P12-Nose_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1441/12N_S6_R2_001.fastq.gz P12-Nose_R2.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1442/13N_S7_R1_001.fastq.gz P13-Nose_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1442/13N_S7_R2_001.fastq.gz P13-Nose_R2.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1443/14N_S8_R1_001.fastq.gz P14-Nose_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1443/14N_S8_R2_001.fastq.gz P14-Nose_R2.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1444/15N_S9_R1_001.fastq.gz P15-Nose_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1444/15N_S9_R2_001.fastq.gz P15-Nose_R2.fastq.gz
    
    #16-20F/N are feet (F) and nose (N) swabs of 5 more patients 10
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1445/16F_S10_R1_001.fastq.gz P16-Foot_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1445/16F_S10_R2_001.fastq.gz P16-Foot_R2.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1446/16N_S11_R1_001.fastq.gz P16-Nose_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1446/16N_S11_R2_001.fastq.gz P16-Nose_R2.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1447/17F_S12_R1_001.fastq.gz P17-Foot_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1447/17F_S12_R2_001.fastq.gz P17-Foot_R2.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1448/17N_S13_R1_001.fastq.gz P17-Nose_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1448/17N_S13_R2_001.fastq.gz P17-Nose_R2.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1449/18F_S14_R1_001.fastq.gz P18-Foot_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1449/18F_S14_R2_001.fastq.gz P18-Foot_R2.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1450/18N_S15_R1_001.fastq.gz P18-Nose_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1450/18N_S15_R2_001.fastq.gz P18-Nose_R2.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1451/19F_S16_R1_001.fastq.gz P19-Foot_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1451/19F_S16_R2_001.fastq.gz P19-Foot_R2.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1452/19N_S17_R1_001.fastq.gz P19-Nose_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1452/19N_S17_R2_001.fastq.gz P19-Nose_R2.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1453/20F_S18_R1_001.fastq.gz P20-Foot_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1453/20F_S18_R2_001.fastq.gz P20-Foot_R2.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1454/20N_S19_R1_001.fastq.gz P20-Nose_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1454/20N_S19_R2_001.fastq.gz P20-Nose_R2.fastq.gz
    
    #Samples 1-108 are swabs of noses, lesioned skin (LH) and non-lesioned skin (NLH) of 11 patients suffering from atopic dermatitis. 33
    #There are more details on these samples in the attached excel file.
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1455/1_S20_R1_001.fastq.gz RP-Nose_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1455/1_S20_R2_001.fastq.gz RP-Nose_R2.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1456/2_S21_R1_001.fastq.gz RP-LH_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1456/2_S21_R2_001.fastq.gz RP-LH_R2.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1457/3_S22_R1_001.fastq.gz RP-NLH_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1457/3_S22_R2_001.fastq.gz RP-NLH_R2.fastq.gz
    
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1458/4_S23_R1_001.fastq.gz AL-Nose_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1458/4_S23_R2_001.fastq.gz AL-Nose_R2.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1459/5_S24_R1_001.fastq.gz AL-LH_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1459/5_S24_R2_001.fastq.gz AL-LH_R2.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1460/6_S25_R1_001.fastq.gz AL-NLH_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1460/6_S25_R2_001.fastq.gz AL-NLH_R2.fastq.gz
    
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1461/22_S26_R1_001.fastq.gz MC-Nose_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1461/22_S26_R2_001.fastq.gz MC-Nose_R2.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1462/23_S27_R1_001.fastq.gz MC-LH_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1462/23_S27_R2_001.fastq.gz MC-LH_R2.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1463/24_S28_R1_001.fastq.gz MC-NLH_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1463/24_S28_R2_001.fastq.gz MC-NLH_R2.fastq.gz
    
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1464/25_S29_R1_001.fastq.gz SA-Nose_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1464/25_S29_R2_001.fastq.gz SA-Nose_R2.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1465/26_S30_R1_001.fastq.gz SA-LH_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1465/26_S30_R2_001.fastq.gz SA-LH_R2.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1466/27_S31_R1_001.fastq.gz SA-NLH_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1466/27_S31_R2_001.fastq.gz SA-NLH_R2.fastq.gz
    
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1467/28_S32_R1_001.fastq.gz HR-Nose_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1467/28_S32_R2_001.fastq.gz HR-Nose_R2.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1468/29_S33_R1_001.fastq.gz HR-LH_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1468/29_S33_R2_001.fastq.gz HR-LH_R2.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1469/30_S34_R1_001.fastq.gz HR-NLH_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1469/30_S34_R2_001.fastq.gz HR-NLH_R2.fastq.gz
    
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1470/34_S35_R1_001.fastq.gz XN-Nose_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1470/34_S35_R2_001.fastq.gz XN-Nose_R2.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1471/35_S36_R1_001.fastq.gz XN-LH_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1471/35_S36_R2_001.fastq.gz XN-LH_R2.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1472/36_S37_R1_001.fastq.gz XN-NLH_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1472/36_S37_R2_001.fastq.gz XN-NLH_R2.fastq.gz
    
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1487/AY_S52_R1_001.fastq.gz MR-Nose_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1487/AY_S52_R2_001.fastq.gz MR-Nose_R2.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1473/50_S38_R1_001.fastq.gz MR-LH_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1473/50_S38_R2_001.fastq.gz MR-LH_R2.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1474/51_S39_R1_001.fastq.gz MR-NLH_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1474/51_S39_R2_001.fastq.gz MR-NLH_R2.fastq.gz
    
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1475/58_S40_R1_001.fastq.gz CB-Nose_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1475/58_S40_R2_001.fastq.gz CB-Nose_R2.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1476/59_S41_R1_001.fastq.gz CB-LH_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1476/59_S41_R2_001.fastq.gz CB-LH_R2.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1477/60_S42_R1_001.fastq.gz CB-NLH_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1477/60_S42_R2_001.fastq.gz CB-NLH_R2.fastq.gz
    
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1478/94_S43_R1_001.fastq.gz KK-Nose_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1478/94_S43_R2_001.fastq.gz KK-Nose_R2.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1479/95_S44_R1_001.fastq.gz KK-LH_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1479/95_S44_R2_001.fastq.gz KK-LH_R2.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1480/96_S45_R1_001.fastq.gz KK-NLH_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1480/96_S45_R2_001.fastq.gz KK-NLH_R2.fastq.gz
    
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1481/103_S46_R1_001.fastq.gz AH-Nose_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1481/103_S46_R2_001.fastq.gz AH-Nose_R2.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1482/104_S47_R1_001.fastq.gz AH-LH_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1482/104_S47_R2_001.fastq.gz AH-LH_R2.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1483/105_S48_R1_001.fastq.gz AH-NLH_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1483/105_S48_R2_001.fastq.gz AH-NLH_R2.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1484/106_S49_R1_001.fastq.gz PT2-Nose_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1484/106_S49_R2_001.fastq.gz PT2-Nose_R2.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1485/107_S50_R1_001.fastq.gz PT2-LH_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1485/107_S50_R2_001.fastq.gz PT2-LH_R2.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1486/108_S51_R1_001.fastq.gz PT2-NLH_R1.fastq.gz
    ln -s ./230425_M03701_0301_000000000-KNW3N/hr1486/108_S51_R2_001.fastq.gz PT2-NLH_R2.fastq.gz
  2. Processing for Epidome yycH and g216

    #-->92-301
    #yycH: 476 + 44      -->minLen=200 (76)
    #g216: 448 + 44      -->minLen=185 (78)
    #16S: 410-427 + 44   -->minLen=170 (70)
    #360/2=180 *
    #200 and 200 *
    #>seq1;
    #TGGGTATGGCAATCACTTTACA
    #AGAATTCTATATTAAAGATGTTCTAATTGTGGAAAAGGGATCCATCGGTCATTCATTTAAACATTGGCCTCTATCAACAAAGACCATCACACCATCATTTACAACTAATGGTTTTGGCATGCCAGATATGAATGCAATAGCTAAAGATACATCACCTGCCTTCACTTTCAATGAAGAACATTTGTCTGGAAATAATTACGCTCAATACATTTCATTAGTAGCTGAGCATTACAATCTAAATGTCAAAACAAATACCAATGTTTCACGTGTAACATACATAGATGGTATATATCATGTATCAACGGACTATGGTGTTTATACCGCAGATTATATATTTATAGCAACTGGAGACTATTCATTCCCATATCATCCTTTTTCATATGGACGTCATTACAGTGAGATTCGAGCGTTCACTCAATTAAACGGTGACGCCTTTACAATTATTGGA GGTAATGAGAGTGCTTTTGATGC
    
    #>M03701:292:000000000-K9M88:1:1101:10277:1358:AAGAGGCA+TATCCTCT
    #TGGGTATGRCAATCACTTTACA
    #AGAATTCAATATTAAAGATGTTCTAATTGTTGAAAAGGGAACCATCGGTCATTCATTTAAACATTGGCCTCTATCAACAAAGACCATCACACCATCATTTACAACTAATGGTTTTGGCATGCCAGATATGAATGCAATAGCTAAAGATACATCACCTGCCTTCACTTTCAATGAAGAACATTTATCTGGAAAACGTTATGCTGAATACCTCTCACTAGTAGCTACGCATTACAATCTAAATGGCAAAACAAACACCAATGTTTCACGTGTAACATACATAGATGGTGTATATCATGTATCAACGGACTATGGTGTTTATACCGCAGATTATATATTTATAGCAACTGGAGACTATTCATTCCCATATCATCCTTTATCATATGGACGTCATTACAGTGAAATTCAAACATTCACTCAATTAAAAGGTGATGCTTTTACAATCATTGGT GGTAATGAGAGTGCTTTTGATGC
    
    #DIR: ~/DATA/Data_Holger_Epidome/testrun2
    #Input: epidome->/home/jhuang/Tools/epidome and rawdata
    
    #Read in 37158 paired-sequences, output 31225 (84%) filtered paired-sequences.
    #Read in 82145 paired-sequences, output 78594 (95.7%) filtered paired-sequences.
    #-->
    #Overwriting file:/home/jhuang/DATA/Data_Holger_Epidome_myData2/cutadapted_yycH/filtered_R1/A10-1_R1.fastq.gz
    #Overwriting file:/home/jhuang/DATA/Data_Holger_Epidome_myData2/cutadapted_yycH/filtered_R2/A10-1_R2.fastq.gz
    #Read in 37158 paired-sequences, output 35498 (95.5%) filtered paired-sequences.
    #Overwriting file:/home/jhuang/DATA/Data_Holger_Epidome_myData2/cutadapted_yycH/filtered_R1/A10-2_R1.fastq.gz
    #Overwriting file:/home/jhuang/DATA/Data_Holger_Epidome_myData2/cutadapted_yycH/filtered_R2/A10-2_R2.fastq.gz
    #Read in 82145 paired-sequences, output 80918 (98.5%) filtered paired-sequences.
    #
    #Read in 46149 paired-sequences, output 22206 (48.1%) filtered paired-sequences.
    #Read in 197875 paired-sequences, output 168942 (85.4%) filtered paired-sequences.
    #Read in 230646 paired-sequences, output 201376 (87.3%) filtered paired-sequences.
    #Read in 175759 paired-sequences, output 149823 (85.2%) filtered paired-sequences.
    #Read in 147546 paired-sequences, output 128864 (87.3%) filtered paired-sequences.

2.1. quality controls (optional)

    #under testrun2 should have
    #BiocManager::install("dada2")
    library(dada2); packageVersion("dada2")
    path <- "/home/jhuang/DATA/Data_Luise_Epidome_batch3/raw_data" # CHANGE ME to the directory containing the fastq files after unzipping.
    list.files(path)

    # Forward and reverse fastq filenames have format: SAMPLENAME_R1_001.fastq and SAMPLENAME_R2_001.fastq
    fnFs <- sort(list.files(path, pattern="_R1.fastq.gz", full.names = TRUE))
    fnRs <- sort(list.files(path, pattern="_R2.fastq.gz", full.names = TRUE))
    # Extract sample names, assuming filenames have format: SAMPLENAME_XXX.fastq
    sample.names <- sapply(strsplit(basename(fnFs), "_"), `[`, 1)
    png("quality_fnFs.png", width=800, height=800)
    plotQualityProfile(fnFs[1:2])
    dev.off()
    png("quality_fnRs.png", width=800, height=800)
    plotQualityProfile(fnRs[1:2])
    dev.off()

2.2. cutadapt instead of Trimmomatic (namely demultiplexing, see epidome/scripts/EPIDOME_yycH_cutadapt_loop.sh)

    #Output: cutadapted_yycH cutadapted_g216 cutadapted_16S
    #Script: epidome/scripts/EPIDOME_yycH_cutadapt_loop.sh

    #5′-CGATGCKAAAGTGCCGAATA-3′/5′-CTTCATTTAAGAAGCCACCWTGACT-3′  for yycH
    #5′-TGGGTATGRCAATCACTTTACA-3′/5′-GCATCAAAAGCACTCTCATTACC-3′  for g216
    #-p CCTACGGGNGGCWGCAG -q GACTACHVGGGTATCTAATCC -l 300        for 16S
    mkdir cutadapted_yycH cutadapted_g216 cutadapted_16S
    cd raw_data
    #The default is --action=trim. With --action=retain, the read is trimmed, but the adapter sequence itself is not removed.
    for file in *_R1.fastq.gz; do
    cutadapt -e 0.06 -g CGATGCKAAAGTGCCGAATA -G CTTCATTTAAGAAGCCACCWTGACT --pair-filter=any -o ../cutadapted_yycH/${file} --paired-output ../cutadapted_yycH/${file/R1.fastq.gz/R2.fastq.gz} --discard-untrimmed $file ${file/R1.fastq.gz/R2.fastq.gz};
    done
    for file in *_R1.fastq.gz; do
    cutadapt -e 0.06 -g TGGGTATGRCAATCACTTTACA -G GCATCAAAAGCACTCTCATTACC --pair-filter=any -o ../cutadapted_g216/${file} --paired-output ../cutadapted_g216/${file/R1.fastq.gz/R2.fastq.gz} --discard-untrimmed $file ${file/R1.fastq.gz/R2.fastq.gz};
    done
    for file in *_R1.fastq.gz; do
    cutadapt -e 0.06 -g CCTACGGGNGGCWGCAG -G GACTACHVGGGTATCTAATCC --pair-filter=any -o ../cutadapted_16S/${file} --paired-output ../cutadapted_16S/${file/R1.fastq.gz/R2.fastq.gz} --discard-untrimmed $file ${file/R1.fastq.gz/R2.fastq.gz};
    done

2.3. (IGNORED) regenerate filtered_R1 and filtered_R2 (under conda env qiime1 using pear) –> IGNORED, since we should use filterAndTrim from data2 in the next step!)

    #DEPRECATED: mkdir pandaseq_16S pandaseq_yycH pandaseq_g216
    #DEPRECATED: -p CCTACGGGNGGCWGCAG -q GACTACHVGGGTATCTAATCC 
    #DEPRECATED: for file in cutadapted_16S/*_R1.fastq.gz; do pandaseq -f ${file} -r ${file/_R1.fastq.gz/_R2.fastq.gz} -l 300  -w pandaseq_16S/$(echo $file | cut -d'/' -f2 | cut -d'_' -f1)_merged.fasta >> LOG_pandaseq_16S; done

    #https://learnmetabarcoding.github.io/LearnMetabarcoding/processing/pair_merging.html#
    #conda install -c conda-forge -c bioconda -c defaults seqkit
    mkdir pear_yycH pear_g216 pear_16S
    for file in cutadapted_yycH/*_R1.fastq.gz; do pear -f ${file} -r ${file/_R1.fastq.gz/_R2.fastq.gz} -j 4 -q 26 -v 10 -o pear_yycH/$(echo $file | cut -d'/' -f2 | cut -d'_' -f1) >> LOG_pear_yycH; done
    for file in cutadapted_g216/*_R1.fastq.gz; do pear -f ${file} -r ${file/_R1.fastq.gz/_R2.fastq.gz} -j 2 -q 26 -v 10 -o pear_g216/$(echo $file | cut -d'/' -f2 | cut -d'_' -f1) >> LOG_pear_g216; done
    for file in cutadapted_16S/*_R1.fastq.gz; do pear -f ${file} -r ${file/_R1.fastq.gz/_R2.fastq.gz} -j 2 -q 26 -v 10 -o pear_16S/$(echo $file | cut -d'/' -f2 | cut -d'_' -f1) >> LOG_pear_16S; done

    for file in cutadapted_yycH/*_R1.fastq.gz; do
    grep "@M0370" pear_yycH/$(echo $file | cut -d'/' -f2 | cut -d'_' -f1).assembled.fastq > cutadapted_yycH/$(echo $file | cut -d'/' -f2 | cut -d'_' -f1)_IDs.txt
    sed -i -e 's/@//g' cutadapted_yycH/$(echo $file | cut -d'/' -f2 | cut -d'_' -f1)_IDs.txt
    cut -d' ' -f1 cutadapted_yycH/$(echo $file | cut -d'/' -f2 | cut -d'_' -f1)_IDs.txt > cutadapted_yycH/$(echo $file | cut -d'/' -f2 | cut -d'_' -f1)_IDs_.txt
    seqkit grep -f cutadapted_yycH/$(echo $file | cut -d'/' -f2 | cut -d'_' -f1)_IDs_.txt cutadapted_yycH/$(echo $file | cut -d'/' -f2 | cut -d'_' -f1)_R1.fastq.gz -o cutadapted_yycH/filtered_R1/$(echo $file | cut -d'/' -f2 | cut -d'_' -f1)_R1.fastq.gz
    seqkit grep -f cutadapted_yycH/$(echo $file | cut -d'/' -f2 | cut -d'_' -f1)_IDs_.txt cutadapted_yycH/$(echo $file | cut -d'/' -f2 | cut -d'_' -f1)_R2.fastq.gz -o cutadapted_yycH/filtered_R2/$(echo $file | cut -d'/' -f2 | cut -d'_' -f1)_R2.fastq.gz
    done
    #>>LOG_pear_yycH

    for file in cutadapted_g216/*_R1.fastq.gz; do
    grep "@M0370" pear_g216/$(echo $file | cut -d'/' -f2 | cut -d'_' -f1).assembled.fastq > cutadapted_g216/$(echo $file | cut -d'/' -f2 | cut -d'_' -f1)_IDs.txt
    sed -i -e 's/@//g' cutadapted_g216/$(echo $file | cut -d'/' -f2 | cut -d'_' -f1)_IDs.txt
    cut -d' ' -f1 cutadapted_g216/$(echo $file | cut -d'/' -f2 | cut -d'_' -f1)_IDs.txt > cutadapted_g216/$(echo $file | cut -d'/' -f2 | cut -d'_' -f1)_IDs_.txt
    seqkit grep -f cutadapted_g216/$(echo $file | cut -d'/' -f2 | cut -d'_' -f1)_IDs_.txt cutadapted_g216/$(echo $file | cut -d'/' -f2 | cut -d'_' -f1)_R1.fastq.gz -o cutadapted_g216/filtered_R1/$(echo $file | cut -d'/' -f2 | cut -d'_' -f1)_R1.fastq.gz
    seqkit grep -f cutadapted_g216/$(echo $file | cut -d'/' -f2 | cut -d'_' -f1)_IDs_.txt cutadapted_g216/$(echo $file | cut -d'/' -f2 | cut -d'_' -f1)_R2.fastq.gz -o cutadapted_g216/filtered_R2/$(echo $file | cut -d'/' -f2 | cut -d'_' -f1)_R2.fastq.gz
    done

    for file in cutadapted_16S/*_R1.fastq.gz; do
    grep "@M0370" pear_16S/$(echo $file | cut -d'/' -f2 | cut -d'_' -f1).assembled.fastq > cutadapted_16S/$(echo $file | cut -d'/' -f2 | cut -d'_' -f1)_IDs.txt
    sed -i -e 's/@//g' cutadapted_16S/$(echo $file | cut -d'/' -f2 | cut -d'_' -f1)_IDs.txt
    cut -d' ' -f1 cutadapted_16S/$(echo $file | cut -d'/' -f2 | cut -d'_' -f1)_IDs.txt > cutadapted_16S/$(echo $file | cut -d'/' -f2 | cut -d'_' -f1)_IDs_.txt
    seqkit grep -f cutadapted_16S/$(echo $file | cut -d'/' -f2 | cut -d'_' -f1)_IDs_.txt cutadapted_16S/$(echo $file | cut -d'/' -f2 | cut -d'_' -f1)_R1.fastq.gz -o cutadapted_16S/filtered_R1/$(echo $file | cut -d'/' -f2 | cut -d'_' -f1)_R1.fastq.gz
    seqkit grep -f cutadapted_16S/$(echo $file | cut -d'/' -f2 | cut -d'_' -f1)_IDs_.txt cutadapted_16S/$(echo $file | cut -d'/' -f2 | cut -d'_' -f1)_R2.fastq.gz -o cutadapted_16S/filtered_R2/$(echo $file | cut -d'/' -f2 | cut -d'_' -f1)_R2.fastq.gz
    done

2.4. filtering+trimming+merging+chimera-removing (VERY_IMPORTANT: under conda env r4-base)

    #Input: cutadapted_yycH, cutadapted_g216
    #Outputs: yycH[g216|16S]_seqtab_from_dada2.rds
    #         yycH[g216|16S]_seqtab_from_dada2.csv
    #         yycH[g216|16S]_seqtab_nochim.rds
    #         yycH[g216|16S]_seqtab_nochim.csv
    #         yycH[g216|16S]_seqtab_image.RData
    #         track_yycH[g216|16S].csv -->
    #The following scripts are modified from epidome/scripts/dada2_for_EPIDOME_yycH_runwise_pipeline.R
    #./my_EPIDOME_g216_runwise_pipeline.R    #minLen=185, with FILTERING, it does not work!
    #./my_EPIDOME_yycH_runwise_pipeline.R    #minLen=200, with FILTERING, it does not work!
    ./my_EPIDOME_g216_runwise_pipeline_.R > g216_runwise_pipeline_.LOG    #minLen=185. NO FILTERING ANY MORE, since the input are filtered_R1 and filtered_R2.
    ./my_EPIDOME_yycH_runwise_pipeline_.R > yycH_runwise_pipeline_.LOG    #minLen=200. NO FILTERING ANY MORE, since the input are filtered_R1 and filtered_R2.
    ./my_EPIDOME_16S_runwise_pipeline.R > 16S_runwise_pipeline.LOG    #minLen=170. IGNORED since the 16S reads will be processed separately as below.
    #END

    #Overwriting file:/media/jhuang/Elements/Data_Luise_Epidome_batch3/cutadapted_yycH/filtered_R1/AH-LH_R1.fastq.gz
    #Overwriting file:/media/jhuang/Elements/Data_Luise_Epidome_batch3/cutadapted_yycH/filtered_R2/AH-LH_R2.fastq.gz
    #Read in 90261 paired-sequences, output 89026 (98.6%) filtered paired-sequences.
    #Overwriting file:/media/jhuang/Elements/Data_Luise_Epidome_batch3/cutadapted_yycH/filtered_R1/AH-NLH_R1.fastq.gz
    #Overwriting file:/media/jhuang/Elements/Data_Luise_Epidome_batch3/cutadapted_yycH/filtered_R2/AH-NLH_R2.fastq.gz
    #Read in 74638 paired-sequences, output 73633 (98.7%) filtered paired-sequences.
    #Overwriting file:/media/jhuang/Elements/Data_Luise_Epidome_batch3/cutadapted_yycH/filtered_R1/AH-Nose_R1.fastq.gz
    #Overwriting file:/media/jhuang/Elements/Data_Luise_Epidome_batch3/cutadapted_yycH/filtered_R2/AH-Nose_R2.fastq.gz
    #Read in 71311 paired-sequences, output 70542 (98.9%) filtered paired-sequences.
    #Overwriting file:/media/jhuang/Elements/Data_Luise_Epidome_batch3/cutadapted_yycH/filtered_R1/AL-LH_R1.fastq.gz
    #Overwriting file:/media/jhuang/Elements/Data_Luise_Epidome_batch3/cutadapted_yycH/filtered_R2/AL-LH_R2.fastq.gz
    #Read in 94807 paired-sequences, output 93082 (98.2%) filtered paired-sequences.

    ~/Tools/csv2xls-0.4/csv_to_xls.py g216_track.csv yycH_track.csv 16S_track.csv -d$';' -o overview.xls;

    # Read the CSV file into a DataFrame
    df <- read.csv("g216_seqtab_from_dada2_nohead.csv", sep=";", row.name=1, header=FALSE)
    #df <- read.csv("g216_seqtab_nochim.csv", sep=";", row.name=1)
    # Calculate the sum for each row
    row_sums <- rowSums(df)
    # Print the row sums
    print(row_sums)

    > print(row_sums)
      AH-LH   AH-NLH  AH-Nose    AL-LH   AL-NLH  AL-Nose    CB-LH   CB-NLH 
      87232    69741    89689    76660    94636   108810    73814    56312 
    CB-Nose    HR-LH   HR-NLH  HR-Nose    KK-LH   KK-NLH  KK-Nose    MC-LH 
      61740    76216    63165    55479    78550    87579    83826    73738 
      MC-NLH  MC-Nose    MR-LH   MR-NLH  MR-Nose P10-Nose P11-Nose P12-Nose 
      100338    94956    88158    63054    82361   103782   108533    90398 
    P13-Nose P14-Nose P15-Nose P16-Foot P16-Nose P17-Foot P17-Nose P18-Foot 
      87059    95656   110207    67058    77606    58339    95407    87775 
    P18-Nose P19-Foot P19-Nose P20-Foot P20-Nose  P7-Nose  P8-Nose  P9-Nose 
      107560    79373    99571   104667   109457   101528    99565   147485 
      PT2-LH  PT2-NLH PT2-Nose    RP-LH   RP-NLH  RP-Nose    SA-LH   SA-NLH 
      81074    89345    77121    68946    71414   113722    53465    40966 
    SA-Nose    XN-LH   XN-NLH  XN-Nose 
      33381    73842    76028    80630

      AH-LH   AH-NLH  AH-Nose    AL-LH   AL-NLH  AL-Nose    CB-LH   CB-NLH 
      79615    52143    57780    70801    89636    99428    52185    54848 
    CB-Nose    HR-LH   HR-NLH  HR-Nose    KK-LH   KK-NLH  KK-Nose    MC-LH 
      50249    58365    45201    38747    56027    65755    53110    72355 
      MC-NLH  MC-Nose    MR-LH   MR-NLH  MR-Nose P10-Nose P11-Nose P12-Nose 
      97363    69881    68599    52386    48364    84634    78491    71877 
    P13-Nose P14-Nose P15-Nose P16-Foot P16-Nose P17-Foot P17-Nose P18-Foot 
      81694    84290   100606    62621    66484    50015    93498    73730 
    P18-Nose P19-Foot P19-Nose P20-Foot P20-Nose  P7-Nose  P8-Nose  P9-Nose 
      93713    63755    70420    81030   105601    94352    70765   124449 
      PT2-LH  PT2-NLH PT2-Nose    RP-LH   RP-NLH  RP-Nose    SA-LH   SA-NLH 
      56869    74951    62339    68726    71145   111279    47754    36180 
    SA-Nose    XN-LH   XN-NLH  XN-Nose 
      26130    55647    69993    50727

    #wc -l cutadapted_yycH/filtered_R1$ vim Extraction-control-2_R1.fastq.gz #-->2696
    #Read in 1138 paired-sequences, output 674 (59.2%) filtered paired-sequences.
    #"Extraction-control-2";0;61;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;591;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0 -->600 sequences
    #"Extraction-control-2";0;61;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;591;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0  #-->after chimera-removing, only 107 sequences
    #Processing: Extraction-control-2
    #Sample 1 - 674 reads in 209 unique sequences (What does the unique sequences mean???).  #merged sequences are 209
    #Sample 1 - 674 reads in 280 unique sequences.
    #"";"input_read_count"*; "filtered_and_trimmed_read_count";"merged_after_dada2_read_count"; "non-chimeric_read_count"*
    #"Extraction-control-2"; 1138*; 674;652; 652*

    #biom convert -i table.txt -o table.from_txt_json.biom --table-type="OTU table" --to-json
    #summarize_taxa_through_plots.py -i clustering/otu_table_mc2_w_tax_no_pynast_failures.biom -o plots/taxa_summary -s
    #summarize_taxa.py -i otu_table.biom -o ./tax

    #g216_track.csv
    "A10-1";36009;493;367;367
    "A10-2";175221;82727;30264;27890
    "A10-3";110170;36812;13715;13065
    "A10-4";142323;64398;24306;21628

    #yycH_track.csv
    "A10-1";37158;549;0;0     pandaseq:36791
    "A10-2";82145;23953;6180;5956
    "A10-3";53438;12480;2944;2944
    "A10-4";64516;18361;12350;11797

    #16S_track.csv
    "A10-1";46149(in cutadapted_16S);13(in filtered_R1);8;8
    "A10-2";197875;2218;1540;1391
    "A10-3";230646;2429;1819;1752
    "A10-4";175759;2001;1439;1366

    #zcat A10-1_R1.fastq.gz | echo $((`wc -l`/4))
    #121007 > 37158 + 36009 + 46149 = 119316
    #458392 > 82145 + 175221 + 197875 = 455241

2.5. (IGNORED) stitching and removing chimeras (see ~/DATA/Data_Holger_Epidome/epidome/scripts/Combine_and_Remove_Chimeras_yycH.R)

    #my_Combine_and_Remove_Chimeras_g216.R is a part of my_EPIDOME_yycH_runwise_pipeline.R (see lines 53-55) --> IGNORED!

2.6. Classification: epidome/scripts/ASV_blast_classification.py

    #Input: g216_seqtab_nochim.csv using DATABASE epidome/DB/g216_ref_aln.fasta
    #Output: g216_seqtab_ASV_seqs.fasta, g216_seqtab_ASV_blast.txt and g216_seqtab.csv.classified.csv

    python3 epidome/scripts/ASV_blast_classification.py yycH_seqtab_nochim.csv yycH_seqtab_ASV_seqs.fasta  epidome/DB/yycH_ref_aln.fasta  yycH_seqtab_ASV_blast.txt yycH_seqtab.csv.classified.csv 99.5
    python3 epidome/scripts/ASV_blast_classification.py g216_seqtab_nochim.csv g216_seqtab_ASV_seqs.fasta  epidome/DB/g216_ref_aln.fasta  g216_seqtab_ASV_blast.txt g216_seqtab.csv.classified.csv 99.5

    #old: python3 epidome/scripts/ASV_blast_classification.py   yycH_seqtab.csv yycH_seqtab.csv.ASV_seqs.fasta  epidome/DB/yycH_ref_aln.fasta yycH_seqtab.csv.ASV_blast.txt yycH_seqtab.csv.classified.csv 99.5
    #old: python3 epidome/scripts/ASV_blast_classification_combined.py -p1 190920_run1_yycH_seqtab_from_dada2.csv -p2 190920_run1_G216_seqtab_from_dada2.csv -p1_ref epidome/DB/yycH_ref_aln.fasta -p2_ref epidome/DB/g216_ref_aln.fasta

    ##rename "seqseq2" --> seq2
    #sed -i -e s/seq//g 190920_run1_yycH_seqtab_from_dada2.csv.ASV_blast.txt
    #sed -i -e s/seqseq/seq/g 190920_run1_yycH_seqtab_from_dada2.csv.classified.csv
    #diff 190920_run1_yycH_seqtab_from_dada2.csv.ASV_seqs.fasta epidome/example_data/190920_run1_yycH_seqtab_from_dada2.csv.ASV_seqs.fasta
    #diff 190920_run1_yycH_seqtab_from_dada2.csv.ASV_blast.txt epidome/example_data/190920_run1_yycH_seqtab_from_dada2.csv.ASV_blast.txt
    #diff 190920_run1_yycH_seqtab_from_dada2.csv.classified.csv epidome/example_data/190920_run1_yycH_seqtab_from_dada2.csv.classified.csv
    ## WHY: 667 seqs in old calculation, but in our calculation only 108 seqs
    ## They took *_seqtab_from_dada2.csv, but we took *_seqtab_nochim.csv. (653 vs 108 records!)
    ##AAAT";"seq37,36";0;
    sed -i -e s/seq//g yycH_seqtab_ASV_blast.txt   #length=476
    sed -i -e s/seq//g g216_seqtab_ASV_blast.txt   #length=448
    #;-->""
    sed -i -e s/';'//g yycH_seqtab_ASV_blast.txt
    sed -i -e s/';'//g g216_seqtab_ASV_blast.txt
    sed -i -e s/seqseq/seq/g yycH_seqtab.csv.classified.csv
    sed -i -e s/seqseq/seq/g g216_seqtab.csv.classified.csv
    #;,seq --> ,seq
    #;"; --> ";
    sed -i -e s/";,seq"/",seq"/g yycH_seqtab.csv.classified.csv
    sed -i -e s/";,seq"/",seq"/g g216_seqtab.csv.classified.csv
    sed -i -e s/";\";"/"\";"/g yycH_seqtab.csv.classified.csv
    sed -i -e s/";\";"/"\";"/g g216_seqtab.csv.classified.csv

    #"ASV";"Seq_number";"even-mock3-1_S258_L001";"even-mock3-2_S282_L001";"even-mock3-3_S199_L001";"staggered-mock3-1_S270_L001";"staggered-mock3-2_S211_L001";"staggered-mock3-3_S223_L001"
    #"ASV";"Seq_number";"Extraction_control_1";"Extraction_control_2";"P01_nose_1";"P01_nose_2";"P01_skin_1";"P01_skin_2";"P02_nose_1";"P02_nose_2";"P02_skin_1";"P02_skin_2";"P03_nose_1";"P03_nose_2";"P03_skin_1";"P03_skin_2";"P04_nose_1";"P04_nose_2";"P04_skin_1";"P04_skin_2";"P05_nose_1";"P05_nose_2";"P05_skin_1";"P05_skin_2";"P06_nose_1";"P06_nose_2";"P06_skin_1";"P06_skin_2";"P07_nose_1";"P07_nose_2";"P07_skin_1";"P07_skin_2";"P08_nose_1";"P08_nose_2";"P08_skin_1";"P08_skin_2";"P09_nose_1";"P09_nose_2";"P09_skin_1";"P09_skin_2";"P10_nose_1";"P10_nose_2";"P10_skin_1";"P10_skin_2";"P11_nose_1";"P11_nose_2";"P11_skin_1";"P11_skin_2";"even-mock3-1_S258_L001";"even-mock3-2_S282_L001";"even-mock3-3_S199_L001";"staggered-mock3-1_S270_L001";"staggered-mock3-2_S211_L001";"staggered-mock3-3_S223_L001"

    grep -v ";NA;" g216_seqtab.csv.classified.csv > g216_seqtab.csv.classified_noNA.csv
    grep -v ";NA;" yycH_seqtab.csv.classified.csv > yycH_seqtab.csv.classified_noNA.csv

    https://github.com/ssi-dk/epidome/blob/master/example_data/190920_run1_G216_seqtab_from_dada2.csv.classified.csv
    #DEBUG using LibreOffice, e.g. libreoffice --calc yycH_seqtab.csv.classified_noNA.csv after adding "ID"; at the corner.
    seq24,seq21 --> seq24,21

    #(OPTIONAL) TO reduce the unclassified, rename seq31,30 --> seq in g216,  seq37,36 --> seq in yycH.

2.7. draw plot from three amplicons: cutadapted_g216, cutadapted_yycH, and cutadapted_16S

    #Taxonomic database setup and classification
    #- Custom databases of all unique g216 and yycH target sequences can be found at https://github.com/ssi-dk/epidome/tree/master/DB. 
    #- We formatted our g216 and yycH gene databases to be compatible with DADA2’s assign-Taxonomy function and used it to classify the S. epidermidis ASVs with the RDP naive Bayesian classifier method (https://github.com/ssi-dk/epidome/tree/master/scripts).
    #- ST classification of samples was performed using the g216 target sequence as the primary identifier. 
    #- All g216 sequences unique to a single clonal cluster in the database were immediately classified as the matching clone, and in cases were the g216 sequence matched multiple clones, the secondary yycH target sequences were parsed to determine which clone was present. When this classification failed to resolve due to multiple potential combinations of sequences, ASVs were categorized as “Unclassified”. Similarly, g216 sequences not found in the database were labelled as “Novel”. 

    #~/Tools/csv2xls-0.4/csv_to_xls.py g216_seqtab.csv.classified_noNA.csv yycH_seqtab.csv.classified_noNA.csv -d$'\t' -o counts.xls

    #under r4-base
    source("epidome/scripts/epidome_functions.R")

    ST_amplicon_table = read.table("epidome/DB/epidome_ST_amplicon_frequencies.txt",sep = "\t")
    #"ST"    "Group" "epi01_ASV"     "epi02_ASV"     "freq"
    #"8888"  "324"   113     1       1       2
    #"815097"        "5"     48      40      2       55
    #"846906"        "225"   61      3       3       1
    #"846960"        "-"     62      3       3       1
    #"847064"        "225"   63      3       3       2
    #"847222"        "225"   65      3       3       3
    #"865555"        "278"   37      5       3       4

    epi01_table = read.table("g216_seqtab.csv.classified_noNA.csv",sep = "\t",header=TRUE,row.names=1)
    epi02_table = read.table("yycH_seqtab.csv.classified_noNA.csv",sep = "\t",header=TRUE,row.names=1)
    #> sum(epi01_table$AH.LH)
    #[1] 78872
    #> sum(epi02_table$AH.LH)
    #[1] 86949

    #construct metadata.txt as follows.
    #"sample.ID"     "patient.ID"    "sample.site"   "sample.type"   "patient.sample.site"
    #P7.Nose P7      Nose    swab    P7.Nose.swab
    #P8.Nose P8      Nose    swab    P8.Nose.swab
    #P9.Nose P9      Nose    swab    P9.Nose.swab
    #P10.Nose        P10     Nose    swab    P10.Nose.swab
    #P11.Nose        P11     Nose    swab    P11.Nose.swab
    #P12.Nose        P12     Nose    swab    P12.Nose.swab
    #P13.Nose        P13     Nose    swab    P13.Nose.swab

    metadata_table = read.table("metadata.txt",header=TRUE,row.names=1)
    metadata_table$patient.ID <- factor(metadata_table$patient.ID, levels=c("P7","P8","P9","P10","P11","P12","P13","P14","P15","P16","P17","P18","P19","P20", "AH","AL","CB","HR","KK",  "MC","MR","PT2","RP","SA","XN"))
    epidome_object = setup_epidome_object(epi01_table,epi02_table,metadata_table = metadata_table)

    #Image1
    primer_compare = compare_primer_output(epidome_object,color_variable = "sample.type")
    png("image1.png")
    primer_compare$plot
    dev.off()

    eo_ASV_combined = combine_ASVs_epidome(epidome_object)
    eo_filtered = filter_lowcount_samples_epidome(eo_ASV_combined,500,500)

    count_table = classify_epidome(eo_ASV_combined,ST_amplicon_table)
    #count_df_ordered = count_table[order(rowSums(count_table),decreasing = T),]

    #install.packages("pls")
    #library(pls)
    #install.packages("reshape")
    #library(reshape)
    #install.packages("vegan")
    library('vegan') 
    library(scales)
    library(RColorBrewer)

    #Image2
    #TODO: find out what are the combination 21006 in Aachen?

    source("epidome/scripts/epidome_functions.R")
    #count_table = count_table[-29,]
    #row.names(count_table) <- c("-", "ST297", "ST170", "ST73", "ST225", "ST673", "ST215", "ST19", "Unclassified")
    #row.names(count_table) <- c("NA", "-", "X297", "X170", "X73", "X225", "X673", "X215", "X19", "Unclassified")
    row.names(count_table) <- c("ST215","ST130","ST278","ST200","ST5","ST59","ST83","ST114","ST297","ST384","ST14","ST89","ST210","-","ST328","ST331","ST73","ST2","ST88","ST100","ST10","ST290","ST87","ST23","ST218","ST329","ST19","ST225","ST170","Unclassified")
    #colnames(count_table) <- c("A2.1","A2.2","A2.3","A3.1","A3.2","A3.3","A4.1","A4.2","A4.3","A4.4","A5.1","A5.2","A5.3","A5.4","A5.5","A5.6","A5.7","A10.1","A10.2","A10.3","A10.4","A17.1","A17.2","A17.3","A21.1","A21.2","A21.3","A22.1","A22.2","A22.3","A24.1","A24.2","A24.3","A25.1","A25.2","A25.3","A27.1","A27.2","A27.3","A28.1","A28.2","A28.3","LM.Nose","LM.Foot","LZ.Foot","LZ.Nose","NG.Foot","NG.Nose","VK.Foot","VK.Nose","AK.Foot","AK.Nose","MS.Foot","MS.Nose","AH.Nose","AH.Foot","AY_Nose","AY.Foot","JS.Nose","JS.Foot","PC.Nose","PC.Foot","SB.Nose","SB.Foot")
    #col_order <- c("A2.1","A2.2","A2.3","A3.1","A3.2","A3.3","A4.1","A4.2","A4.3","A4.4","A5.1","A5.2","A5.3","A5.4","A5.5","A5.6","A5.7","A10.1","A10.2","A10.3","A10.4","A17.1","A17.2","A17.3","A21.1","A21.2","A21.3","A22.1","A22.2","A22.3","A24.1","A24.2","A24.3","A25.1","A25.2","A25.3","A27.1","A27.2","A27.3","A28.1","A28.2","A28.3", "LM.Nose","LM.Foot", "LZ.Nose","LZ.Foot", "NG.Nose","NG.Foot", "VK.Nose","VK.Foot", "AK.Nose","AK.Foot", "MS.Nose","MS.Foot", "AH.Nose","AH.Foot", "AY_Nose","AY.Foot", "JS.Nose","JS.Foot", "PC.Nose","PC.Foot", "SB.Nose","SB.Foot")
    #count_table_reordered <- count_table[,col_order]
    count_table_reordered <- count_table

    write.csv(file="count_table.txt", count_table_reordered)
    #NOTE to change rowname from '-' to 'Novel'

    p = make_barplot_epidome(count_table_reordered,reorder=FALSE,normalize=TRUE)
    #p = make_barplot_epidome(count_table_reordered,reorder=TRUE,normalize=TRUE)
    png("Barplot_All.png", width=1600, height=900)
    p
    dev.off()

    #Image3
    eo_clinical = prune_by_variable_epidome(epidome_object,"sample.type",c("swab"))
    #eo_Aachen = prune_by_variable_epidome(epidome_object,"sample.type",c("Aachen"))

    epidome_object_clinical_norm = normalize_epidome_object(eo_clinical) ### Normalize counts to percent
    #epidome_object_Aachen_norm = normalize_epidome_object(eo_Aachen)

    png("PCA_by_patientID.png", width=1200, height=800)
    PCA_patient_colored = plot_PCA_epidome(eo_filtered,color_variable = "patient.ID",colors=c(), plot_ellipse = FALSE)
    PCA_patient_colored + ggtitle("PCA plot of all samples colored by patient ID")
    dev.off()
    png("PCA_Clinical_by_patientID.png", width=1200, height=800)
    PCA_patient_colored = plot_PCA_epidome(epidome_object_clinical_norm,color_variable = "patient.ID",colors=c(), plot_ellipse = FALSE)
    PCA_patient_colored + ggtitle("PCA plot of clinical samples colored by patient ID")
    dev.off()
    #png("PCA_Aachen_by_patientID.png", width=1200, height=800)
    #PCA_sample_site_colored = plot_PCA_epidome(epidome_object_Aachen_norm,color_variable = "patient.ID",colors=c(), plot_ellipse = FALSE)
    #PCA_sample_site_colored + ggtitle("PCA plot of nose and foot samples colored by patient ID")
    #dev.off()
    #png("PCA_Aachen_by_sampleSite.png", width=1200, height=800)
    #PCA_sample_site_colored = plot_PCA_epidome(epidome_object_Aachen_norm,color_variable = "sample.site",colors = c("Red","Blue"),plot_ellipse = FALSE)
    #PCA_sample_site_colored + ggtitle("PCA plot of nose and foot samples colored by sampling site")
    #dev.off()

    #Image4
    eo_filter_lowcount = filter_lowcount_samples_epidome(epidome_object,p1_threshold = 500,p2_threshold = 500)
    #-->[1] "1 low count samples removed from data: RP.Nose"
    eo_filter_ASVs = epidome_filtered_ASVs = filter_lowcount_ASVs_epidome(epidome_object,percent_threshold = 1)
    epidome_object_normalized = normalize_epidome_object(epidome_object)
    epidome_object_ASV_combined = combine_ASVs_epidome(epidome_object)
    epidome_object_clinical = prune_by_variable_epidome(epidome_object,variable_name = "sample.type",variable_values = c("swab"))
    #epidome_object_Aachen= prune_by_variable_epidome(epidome_object,variable_name = "sample.type",variable_values = c("Aachen"))

    eo_ASV_combined = combine_ASVs_epidome(epidome_object_clinical)
    count_table_reordered = classify_epidome(eo_ASV_combined,ST_amplicon_table)
    p = make_barplot_epidome(count_table_reordered,reorder=FALSE,normalize=TRUE)
    png("Barplot_Clinical.png", width=1200, height=800)
    p
    dev.off()

    #eo_ASV_combined = combine_ASVs_epidome(epidome_object_Aachen)
    #count_table_reordered = classify_epidome(eo_ASV_combined,ST_amplicon_table)
    #colnames(count_table_reordered) <- c("LM.Nose","LM.Foot","LZ.Nose","LZ.Foot","NG.Nose","NG.Foot","VK.Nose","VK.Foot","AK.Nose","AK.Foot","MS.Nose","MS.Foot","AH.Nose","AH.Foot","AY_Nose","AY.Foot","JS.Nose","JS.Foot","PC.Nose","PC.Foot","SB.Nose","SB.Foot")
    #p = make_barplot_epidome(count_table_reordered,reorder=FALSE,normalize=TRUE)
    #png("Barplot_Aachen.png", width=1200, height=800)
    #p
    #dev.off()

3.0. The methods for 16S

3.1. stitch

    mkdir pandaseq.out
    #-p CCTACGGGNGGCWGCAG -q GACTACHVGGGTATCTAATCC
    for file in cutadapted_16S/filtered_R1/*_R1.fastq.gz; do echo "pandaseq -f ${file} -r ${file/_R1/_R2} -l 300 -w pandaseq.out/$(echo $file | cut -d'/' -f3 | cut -d'_' -f1)_merged.fasta >> LOG_pandaseq"; done

    pandaseq -f cutadapted_16S/filtered_R1/AH-LH_R1.fastq.gz -r cutadapted_16S/filtered_R2/AH-LH_R2.fastq.gz -l 300 -w pandaseq.out/AH-LH_merged.fasta >> LOG_pandaseq
    ...

    grep ">" AH-LH_merged.fasta | wc -l
    ...

    jhuang@hamburg:~/DATA/Data_Luise_Epidome_batch3/core_diversity_e33778$ grep "AH.LH" biom_table_summary.txt
    AH.LH: 75.566,000
    ...

3.2. create two QIIME mapping files

    validate_mapping_file.py -m map2.txt

3.3. combine files into a labeled file

    add_qiime_labels.py -i pandaseq.out -m map2_corrected.txt -c FileInput -o combined_fasta

3.4. remove chimeric sequences using usearch

    cd combined_fasta
    pyfasta split -n 100 combined_seqs.fna
    for i in {000..099}; do echo "identify_chimeric_seqs.py -i combined_fasta/combined_seqs.fna.${i} -m usearch61 -o usearch_checked_combined.${i}/ -r ~/REFs/gg_97_otus_4feb2011_fw_rc.fasta --threads=14;" >> uchime_commands.sh; done
    mv uchime_commands.sh ..
    ./uchime_commands.sh

    cat usearch_checked_combined.000/chimeras.txt usearch_checked_combined.001/chimeras.txt usearch_checked_combined.002/chimeras.txt usearch_checked_combined.003/chimeras.txt usearch_checked_combined.004/chimeras.txt usearch_checked_combined.005/chimeras.txt usearch_checked_combined.006/chimeras.txt usearch_checked_combined.007/chimeras.txt usearch_checked_combined.008/chimeras.txt usearch_checked_combined.009/chimeras.txt usearch_checked_combined.010/chimeras.txt usearch_checked_combined.011/chimeras.txt usearch_checked_combined.012/chimeras.txt usearch_checked_combined.013/chimeras.txt usearch_checked_combined.014/chimeras.txt usearch_checked_combined.015/chimeras.txt usearch_checked_combined.016/chimeras.txt usearch_checked_combined.017/chimeras.txt usearch_checked_combined.018/chimeras.txt usearch_checked_combined.019/chimeras.txt usearch_checked_combined.020/chimeras.txt usearch_checked_combined.021/chimeras.txt usearch_checked_combined.022/chimeras.txt usearch_checked_combined.023/chimeras.txt usearch_checked_combined.024/chimeras.txt usearch_checked_combined.025/chimeras.txt usearch_checked_combined.026/chimeras.txt usearch_checked_combined.027/chimeras.txt usearch_checked_combined.028/chimeras.txt usearch_checked_combined.029/chimeras.txt usearch_checked_combined.030/chimeras.txt usearch_checked_combined.031/chimeras.txt usearch_checked_combined.032/chimeras.txt usearch_checked_combined.033/chimeras.txt usearch_checked_combined.034/chimeras.txt usearch_checked_combined.035/chimeras.txt usearch_checked_combined.036/chimeras.txt usearch_checked_combined.037/chimeras.txt usearch_checked_combined.038/chimeras.txt usearch_checked_combined.039/chimeras.txt usearch_checked_combined.040/chimeras.txt usearch_checked_combined.041/chimeras.txt usearch_checked_combined.042/chimeras.txt usearch_checked_combined.043/chimeras.txt usearch_checked_combined.044/chimeras.txt usearch_checked_combined.045/chimeras.txt usearch_checked_combined.046/chimeras.txt usearch_checked_combined.047/chimeras.txt usearch_checked_combined.048/chimeras.txt usearch_checked_combined.049/chimeras.txt usearch_checked_combined.050/chimeras.txt usearch_checked_combined.051/chimeras.txt usearch_checked_combined.052/chimeras.txt usearch_checked_combined.053/chimeras.txt usearch_checked_combined.054/chimeras.txt usearch_checked_combined.055/chimeras.txt usearch_checked_combined.056/chimeras.txt usearch_checked_combined.057/chimeras.txt usearch_checked_combined.058/chimeras.txt usearch_checked_combined.059/chimeras.txt usearch_checked_combined.060/chimeras.txt usearch_checked_combined.061/chimeras.txt usearch_checked_combined.062/chimeras.txt usearch_checked_combined.063/chimeras.txt usearch_checked_combined.064/chimeras.txt usearch_checked_combined.065/chimeras.txt usearch_checked_combined.066/chimeras.txt usearch_checked_combined.067/chimeras.txt usearch_checked_combined.068/chimeras.txt usearch_checked_combined.069/chimeras.txt usearch_checked_combined.070/chimeras.txt usearch_checked_combined.071/chimeras.txt usearch_checked_combined.072/chimeras.txt usearch_checked_combined.073/chimeras.txt usearch_checked_combined.074/chimeras.txt usearch_checked_combined.075/chimeras.txt usearch_checked_combined.076/chimeras.txt usearch_checked_combined.077/chimeras.txt usearch_checked_combined.078/chimeras.txt usearch_checked_combined.079/chimeras.txt usearch_checked_combined.080/chimeras.txt usearch_checked_combined.081/chimeras.txt usearch_checked_combined.082/chimeras.txt usearch_checked_combined.083/chimeras.txt usearch_checked_combined.084/chimeras.txt usearch_checked_combined.085/chimeras.txt usearch_checked_combined.086/chimeras.txt usearch_checked_combined.087/chimeras.txt usearch_checked_combined.088/chimeras.txt usearch_checked_combined.089/chimeras.txt usearch_checked_combined.090/chimeras.txt usearch_checked_combined.091/chimeras.txt usearch_checked_combined.092/chimeras.txt usearch_checked_combined.093/chimeras.txt usearch_checked_combined.094/chimeras.txt usearch_checked_combined.095/chimeras.txt usearch_checked_combined.096/chimeras.txt usearch_checked_combined.097/chimeras.txt usearch_checked_combined.098/chimeras.txt usearch_checked_combined.099/chimeras.txt > chimeras.txt
    filter_fasta.py -f combined_fasta/combined_seqs.fna -o combined_fasta/combined_nonchimera_seqs.fna -s chimeras.txt -n;
    rm -rf usearch_checked_combined.0*

    grep ">AH.LH_" combined_nonchimera_seqs.fna | wc -l
    ...

3.5. create OTU picking parameter file, and run the QIIME open reference picking pipeline

    echo "pick_otus:similarity 0.97" > clustering_params.txt
    echo "assign_taxonomy:similarity 0.97" >> clustering_params.txt
    echo "parallel_align_seqs_pynast:template_fp /home/jhuang/REFs/SILVA_132_QIIME_release/core_alignment/80_core_alignment.fna" >> clustering_params.txt
    echo "assign_taxonomy:reference_seqs_fp /home/jhuang/REFs/SILVA_132_QIIME_release/rep_set/rep_set_16S_only/99/silva_132_99_16S.fna" >> clustering_params.txt
    echo "assign_taxonomy:id_to_taxonomy_fp /home/jhuang/REFs/SILVA_132_QIIME_release/taxonomy/16S_only/99/consensus_taxonomy_7_levels.txt" >> clustering_params.txt
    echo "alpha_diversity:metrics chao1,observed_otus,shannon,PD_whole_tree" >> clustering_params.txt
    #with usearch61 for reference picking and usearch61_ref for de novo OTU picking
    pick_open_reference_otus.py -r/home/jhuang/REFs/SILVA_132_QIIME_release/rep_set/rep_set_16S_only/99/silva_132_99_16S.fna -i combined_fasta/combined_nonchimera_seqs.fna -o clustering/ -p clustering_params.txt --parallel

3.6. (optional), for control data

    summarize_taxa_through_plots.py -i clustering/otu_table_mc2_w_tax_no_pynast_failures.biom -o plots/taxa_summary -s
    mv usearch_checked_combined usearch_checked_combined_ctrl
    mv combined_fasta combined_fasta_ctrl
    mv clustering clustering_ctrl
    mv plots plots_ctrl

3.7. for other data: core diversity analyses

    core_diversity_analyses.py -o./core_diversity_e33778 -i./clustering/otu_table_mc2_w_tax_no_pynast_failures.biom -m./map2_corrected.txt -t./clustering/rep_set.tre -e33778 -p./clustering_params.txt

4.0. using R-code to summarize all results.

rmarkdown::render('Phyloseq.Rmd',output_file='Phyloseq.html')

Phyloseq.Rmd

Phyloseq.html

RNAseq running with umi_tools

  1. install conda environment

    #conda config --set auto_activate_base false
    
    conda create --name rnaseq python=3.7
    
    #NOTE: mamba 确实快多了,以后都用 mamba❕
    #install packages
    conda activate rnaseq
    pip3 install deeptools
    pip3 install multiqc
    conda install -c bioconda stringtie subread gffread
    conda install -c conda-forge -c bioconda -c defaults -c r r-data.table r-gplots
    conda install -c conda-forge -c bioconda -c defaults -c r bioconductor-dupradar bioconductor-edger
    conda install nextflow=23.04
    
    conda install fq
    conda install -c bioconda umi_tools
    conda install -c bioconda rsem
    conda install -c bioconda salmon
    
    #conda install some tools
    #install R-packages, 
    conda install -c bioconda ucsc-bedclip
    conda install -c bioconda ucsc-bedgraphtobigwig
    conda install -c bioconda bioconductor-matrixgenerics
    #conda install -c bioconda bioconductor-deseq2
    conda install -c bioconda r-pheatmap
    conda install -c anaconda gawk
    
    conda install mamba -n base -c conda-forge
    conda config --add channels conda-forge
    mamba install -c bioconda salmon=1.10
    #salmon should be >= 1.10 since in those version salmon set `--validateMappings` as default.
    
    conda install -c bioconda trim-galore star=2.6.1d bioconductor-summarizedexperiment bioconductor-tximport bioconductor-tximeta bioconductor-deseq2
    mamba install -c bioconda samtools=1.9  
    mamba install -c conda-forge r-optparse r-vctrs=0.5.0
    conda install nextflow=23.04
    mamba install -c bioconda qualimap
    mamba install -c bioconda rseqc
    mamba install -c conda-forge openssl
    conda install -c bioconda ucsc-bedclip
    conda install -c bioconda bedtools
    conda update -c bioconda ucsc-bedclip
    #for DEBUG: bedClip: error while loading shared libraries: libssl.so.1.0.0: cannot open shared object file: No such file or directory
    conda update -c bioconda ucsc-bedgraphtobigwig
    # samtools should be >= 1.9 as only those have the option @
    #samtools sort \
    #      -@ 6 \
    #      -o HSV.d2_r1.sorted.bam \
    #      -T HSV.d2_r1.sorted \
    #      HSV.d2_r1.Aligned.out.bam
  2. run UMItools without –umitools_dedup_stats, otherwise it cannot be finished in hamm.

    • Optimize UMItools parameters: Some parameters might influence the memory usage of UMItools. For example, you can try to reduce the number of allowed mismatches in the UMI sequence (–edit-distance-threshold). This will make the deduplication process less memory intensive but might also impact the results.

    • Use other deduplication tools: If the problem persists, you might need to use alternative tools for UMI deduplication which are less memory-intensive. Tools such as fgbio have a grouping and deduplication method similar to UMItools but have been reported to require less memory.

      #https://github.com/nf-core/rnaseq/issues/827
      #INFO for DEBUG: https://umi-tools.readthedocs.io/en/latest/faq.html
      #INFO for DEBUG: https://readthedocs.org/projects/umi-tools/downloads/pdf/stable/
      #https://github.com/CGATOxford/UMI-tools/issues/173
      # excessive dedup memory usage with output-stats #409 
      #https://github.com/CGATOxford/UMI-tools/issues/409
      #umi_tools 1.0.1
      #I am aware of previously closed issues:
      #excessive dedup memory usage #173
      #speed up stats #184
      #Running a single-end bam file with 3.13M reads and a 10bp (fully random) UMI.
      #Using --method=unique
      #There still seems to be a memory problem with --output-stats
      #Running with output-stats, memory usage climbs over 100GB and eventually crashes with "MemoryError".
      #Running without output-stats, job completes in about 3 minutes, with no problems.
      
      #TRY STANDALONE RUNNING: /usr/local/bin/python /usr/local/bin/umi_tools dedup -I HSV.d8_r1.transcriptome.sorted.bam -S HSV.d8_r1.umi_dedup.transcriptome.sorted.bam --method=unique --random-seed=100 
      #/home/jhuang/miniconda3/envs/rnaseq/bin/python /home/jhuang/miniconda3/envs/rnaseq/bin/umi_tools dedup -I star_salmon/HSV.d8_r1.sorted.bam -S HSV.d8_r1.umi_dedup.sorted.bam --output-stats HSV.d8_r1.umi_dedup.sorted --method=unique --random-seed=100
      
      #umitools dedup uses large amounts of memory and runs slowly. To speed it up it is recommended to only run it on a single chromosome, see the FAQ point number 4.
      #I suggest either making the --output-stats optional, or running a second round of deduplication on a single chromosome to generate the output stats.
      
      #--Human--
      #hamm
      /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results_GRCh38 --genome GRCh38   --with_umi --umitools_extract_method "regex" --umitools_bc_pattern "^(?P
      .{12}).*” -profile docker -resume –max_cpus 54 –max_memory 120.GB –max_time 2400.h –save_align_intermeds –save_unaligned –save_reference –aligner ‘star_salmon’ –pseudo_aligner ‘salmon’ –umitools_grouping_method ‘unique’ #sage nextflow run rnaseq/main.nf –input samplesheet.csv –outdir results_GRCh38 –genome GRCh38 –with_umi –umitools_extract_method “regex” –umitools_bc_pattern “^(?P .{12}).*” -profile test_full -resume –max_memory 256.GB –max_time 2400.h –save_align_intermeds –save_unaligned –save_reference –aligner ‘star_salmon’ –pseudo_aligner ‘salmon’ #–Virus– /usr/local/bin/nextflow run rnaseq/main.nf –input samplesheet.csv –outdir results_virus –fasta “/home/jhuang/DATA/Data_Manja_RNAseq_Organoids_Virus/X14112.1.fasta” –gtf “/home/jhuang/DATA/Data_Manja_RNAseq_Organoids_Virus/X14112.1_v4.gtf” –with_umi –umitools_extract_method “regex” –umitools_bc_pattern “^(?P .{12}).*” –umitools_dedup_stats –skip_rseqc –skip_dupradar –skip_preseq -profile test_full -resume –max_cpus 55 –max_memory 120.GB –max_time 2400.h –save_align_intermeds –save_unaligned –save_reference –aligner ‘hisat2’ –gtf_extra_attributes ‘gene_name’ –gtf_group_features ‘gene_id’ –featurecounts_group_type ‘gene_name’ –featurecounts_feature_type ‘exon’ –umitools_grouping_method ‘unique’
  3. R-code for evaluation of nextflow outputs

    # Import the required libraries
    library("AnnotationDbi")
    library("clusterProfiler")
    library("ReactomePA")
    library(gplots)
    
    library(tximport)
    library(DESeq2)
    
    setwd("~/DATA/Data_Manja_RNAseq_Organoids/results_GRCh38_unique/star_salmon")
    
    # Define paths to your Salmon output quantification files
    files <- c("control_r1" = "./control_r1/quant.sf",
              "control_r2" = "./control_r2/quant.sf",
              "HSV.d2_r1" = "./HSV.d2_r1/quant.sf",
              "HSV.d2_r2" = "./HSV.d2_r2/quant.sf",
              "HSV.d4_r1" = "./HSV.d4_r1/quant.sf",
              "HSV.d4_r2" = "./HSV.d4_r2/quant.sf",
              "HSV.d6_r1" = "./HSV.d6_r1/quant.sf",
              "HSV.d6_r2" = "./HSV.d6_r2/quant.sf",
              "HSV.d8_r1" = "./HSV.d8_r1/quant.sf",
              "HSV.d8_r2" = "./HSV.d8_r2/quant.sf")
    
    # Import the transcript abundance data with tximport
    txi <- tximport(files, type = "salmon", txIn = TRUE, txOut = TRUE)
    
    # Define the replicates and condition of the samples
    replicate <- factor(c("r1", "r2", "r1", "r2", "r1", "r2", "r1", "r2", "r1", "r2"))
    condition <- factor(c("control", "control", "HSV.d2", "HSV.d2", "HSV.d4", "HSV.d4", "HSV.d6", "HSV.d6", "HSV.d8", "HSV.d8"))
    
    # Define the colData for DESeq2
    colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files))
    
    # Create DESeqDataSet object
    dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition)
    
    # In the context of your new code which is using tximport and DESeq2, you don't necessarily need this step. The reason is that DESeq2 performs its own filtering of low-count genes during the normalization and differential expression steps.
    # Filter data to retain only genes with more than 2 counts > 3 across all samples
    # dds <- dds[rowSums(counts(dds) > 3) > 2, ]
    
    # Run DESeq2
    dds <- DESeq(dds)
    
    # Perform rlog transformation
    rld <- rlogTransformation(dds)
    
    # Output raw count data to a CSV file
    write.csv(counts(dds), file="transcript_counts.csv")
    
    # -- gene-level count data --
    # Read in the tx2gene map from salmon_tx2gene.tsv
    #tx2gene <- read.csv("salmon_tx2gene.tsv", sep="\t", header=FALSE)
    tx2gene <- read.table("salmon_tx2gene.tsv", header=FALSE, stringsAsFactors=FALSE)
    
    # Set the column names
    colnames(tx2gene) <- c("transcript_id", "gene_id", "gene_name")
    
    # Remove the gene_name column if not needed
    tx2gene <- tx2gene[,1:2]
    
    # Import and summarize the Salmon data with tximport
    txi <- tximport(files, type = "salmon", tx2gene = tx2gene, txOut = FALSE)
    
    # Continue with the DESeq2 workflow as before...
    colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files))
    dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition)
    #dds <- dds[rowSums(counts(dds) > 3) > 2, ]    #60605-->26543
    dds <- DESeq(dds)
    rld <- rlogTransformation(dds)
    write.csv(counts(dds, normalized=FALSE), file="gene_counts.csv")
    
    #TODO: why a lot of reads were removed due to the too_short?
    #STAR --runThreadN 4 --genomeDir /path/to/GenomeDir --readFilesIn /path/to/read1.fastq /path/to/read2.fastq --outFilterMatchNmin 50 --outSAMtype BAM SortedByCoordinate --outFileNamePrefix /path/to/output
    
    dim(counts(dds))
    head(counts(dds), 10)

X-ray holographic microscopy

X-ray holographic microscopy is a technique used to produce high-resolution, three-dimensional images of microscopic objects. The technique is based on the principles of holography, where the phase and amplitude of a wave are recorded to produce an image. In X-ray holography, this wave is an X-ray beam.

Traditional optical microscopy uses visible light to image an object, and the resolution of the image is limited by the wavelength of the light. X-rays have much shorter wavelengths than visible light, so X-ray microscopy can theoretically produce images with much higher resolution.

In X-ray holography, a coherent X-ray beam is split into two paths: one path interacts with the object being imaged, and the other path is used as a reference. The object wave and the reference wave are then combined to form a hologram. This hologram can be reconstructed to produce a 3D image of the object.

One major advantage of X-ray holographic microscopy is that it can be used to image thick samples and materials that are not transparent to visible light. However, the technique requires sophisticated equipment and careful sample preparation, and it can be difficult to interpret the resulting images.

X射线全息显微镜术是一种用来生成微观物体的高分辨率三维图像的技术。这种技术基于全息术的原理,全息术记录波的相位和振幅以产生图像。在X射线全息术中,这种波是X射线束。

传统的光学显微镜使用可见光来成像物体,图像的分辨率受到光的波长的限制。X射线的波长比可见光短得多,因此理论上X射线显微镜可以产生分辨率更高的图像。

在X射线全息术中,一个相干的X射线束被分割成两条路径:一条路径与被成像的物体相互作用,另一条路径作为参考使用。然后将物体波和参考波结合形成一个全息图。这个全息图可以被重建成物体的3D图像。

X射线全息显微镜的一个主要优点是它可以用来成像厚样本和对可见光不透明的材料。然而,这种技术需要复杂的设备和精心的样品制备,并且解析结果图像可能会有困难。

人体类器官(Organoids)

类器官(Organoids)是模拟真实器官或组织的结构和功能的三维(3D)细胞培养。它们来源于干细胞,这种细胞具有自我更新和分化为各种细胞类型的能力。在实验室中,可以使用专门的技术和生长条件培养类器官,促使干细胞发育成特定器官的细胞并形成类似目标器官的复杂微型结构。

类器官在研究中具有重要作用,因为与传统的二维(2D)细胞培养相比,它们更准确地代表了人体器官。它们在各个领域具有广泛的应用,包括:

  1. 发育生物学:类器官可以帮助研究人员研究器官发育和组织组织过程。

  2. 疾病建模:类器官可以从患者来源的干细胞中产生,使研究人员能够创建特定于疾病的模型,以研究各种疾病和病状的基本机制。

  3. 药物开发和测试:类器官为测试新药物和治疗方法提供了更具生理相关性的模型,有可能减少对动物模型的依赖,并提高药物开发的效率。

  4. 再生医学:类器官可用于开发新的组织修复和再生策略,可能为各种疾病和损伤提供新的治疗方法。

尽管类器官具有诸多优点,但它们也存在局限性,如缺乏血管、免疫细胞和其他真实器官中存在的成分。然而,正在进行的研究旨在改进类器官技术并克服这些局限性,进一步扩大其在生物医学研究中的潜在应用。

Organoids are three-dimensional (3D) cell cultures that mimic the structure and function of real organs or tissues. They are derived from stem cells, which have the ability to self-renew and differentiate into various cell types. Organoids can be grown in the lab using specialized techniques and growth conditions that encourage the stem cells to develop into organ-specific cells and form complex, miniature structures resembling the target organ.

Organoids have become an essential tool in research because they provide a more accurate representation of human organs compared to traditional two-dimensional (2D) cell cultures. They have numerous applications in various fields, including:

  1. Developmental biology: Organoids can help researchers study the processes involved in organ development and tissue organization.

  2. Disease modeling: Organoids can be generated from patient-derived stem cells, allowing researchers to create disease-specific models to study the underlying mechanisms of various diseases and conditions.

  3. Drug development and testing: Organoids provide a more physiologically relevant model for testing new drugs and therapies, potentially reducing the reliance on animal models and increasing the efficiency of drug development.

  4. Regenerative medicine: Organoids can be used to develop new strategies for tissue repair and regeneration, possibly leading to new treatments for various diseases and injuries.

Despite their advantages, organoids also have limitations, such as the lack of blood vessels, immune cells, and other components present in real organs. However, ongoing research aims to refine organoid technology and overcome these limitations, further expanding their potential applications in biomedical research.

RNAseq processing for organoids

  1. install conda environment

    #conda config --set auto_activate_base false
    
    conda create --name rnaseq python=3.7
    
    #NOTE: mamba 确实快多了,以后都用 mamba❕
    #install packages
    conda activate rnaseq
    pip3 install deeptools
    pip3 install multiqc
    conda install -c bioconda stringtie subread gffread
    conda install -c conda-forge -c bioconda -c defaults -c r r-data.table r-gplots
    conda install -c conda-forge -c bioconda -c defaults -c r bioconductor-dupradar bioconductor-edger
    conda install nextflow=23.04
    
    conda install fq
    conda install -c bioconda umi_tools
    conda install -c bioconda rsem
    conda install -c bioconda salmon
    
    #conda install some tools
    #install R-packages, 
    conda install -c bioconda ucsc-bedclip
    conda install -c bioconda ucsc-bedgraphtobigwig
    conda install -c bioconda bioconductor-matrixgenerics
    #conda install -c bioconda bioconductor-deseq2
    conda install -c bioconda r-pheatmap
    conda install -c anaconda gawk
    
    conda install mamba -n base -c conda-forge
    conda config --add channels conda-forge
    mamba install -c bioconda salmon=1.10
    #salmon should be >= 1.10 since in those version salmon set `--validateMappings` as default.
    
    conda install -c bioconda trim-galore star=2.6.1d bioconductor-summarizedexperiment bioconductor-tximport bioconductor-tximeta bioconductor-deseq2
    mamba install -c bioconda samtools=1.9  
    mamba install -c conda-forge r-optparse r-vctrs=0.5.0
    conda install nextflow=23.04
    mamba install -c bioconda qualimap
    mamba install -c bioconda rseqc
    mamba install -c conda-forge openssl
    conda install -c bioconda ucsc-bedclip
    conda install -c bioconda bedtools
    conda update -c bioconda ucsc-bedclip
    #for DEBUG: bedClip: error while loading shared libraries: libssl.so.1.0.0: cannot open shared object file: No such file or directory
    conda update -c bioconda ucsc-bedgraphtobigwig
    # samtools should be >= 1.9 as only those have the option @
    #samtools sort \
    #      -@ 6 \
    #      -o HSV.d2_r1.sorted.bam \
    #      -T HSV.d2_r1.sorted \
    #      HSV.d2_r1.Aligned.out.bam
  2. run UMItools without –umitools_dedup_stats, otherwise it cannot be finished in hamm.

    • Optimize UMItools parameters: Some parameters might influence the memory usage of UMItools. For example, you can try to reduce the number of allowed mismatches in the UMI sequence (–edit-distance-threshold). This will make the deduplication process less memory intensive but might also impact the results.

    • Use other deduplication tools: If the problem persists, you might need to use alternative tools for UMI deduplication which are less memory-intensive. Tools such as fgbio have a grouping and deduplication method similar to UMItools but have been reported to require less memory.

      #https://github.com/nf-core/rnaseq/issues/827 #INFO for DEBUG: https://umi-tools.readthedocs.io/en/latest/faq.html #INFO for DEBUG: https://readthedocs.org/projects/umi-tools/downloads/pdf/stable/ #https://github.com/CGATOxford/UMI-tools/issues/173

      excessive dedup memory usage with output-stats #409

      #https://github.com/CGATOxford/UMI-tools/issues/409 #umi_tools 1.0.1 #I am aware of previously closed issues: #excessive dedup memory usage #173 #speed up stats #184 #Running a single-end bam file with 3.13M reads and a 10bp (fully random) UMI. #Using –method=unique #There still seems to be a memory problem with –output-stats #Running with output-stats, memory usage climbs over 100GB and eventually crashes with “MemoryError”. #Running without output-stats, job completes in about 3 minutes, with no problems.

        #TRY STANDALONE RUNNING: /usr/local/bin/python /usr/local/bin/umi_tools dedup -I HSV.d8_r1.transcriptome.sorted.bam -S HSV.d8_r1.umi_dedup.transcriptome.sorted.bam --method=unique --random-seed=100 
        #/home/jhuang/miniconda3/envs/rnaseq/bin/python /home/jhuang/miniconda3/envs/rnaseq/bin/umi_tools dedup -I star_salmon/HSV.d8_r1.sorted.bam -S HSV.d8_r1.umi_dedup.sorted.bam --output-stats HSV.d8_r1.umi_dedup.sorted --method=unique --random-seed=100

      #umitools dedup uses large amounts of memory and runs slowly. To speed it up it is recommended to only run it on a single chromosome, see the FAQ point number 4. #I suggest either making the –output-stats optional, or running a second round of deduplication on a single chromosome to generate the output stats.

        #--Human--
        #hamm
        /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results_GRCh38 --genome GRCh38   --with_umi --umitools_extract_method "regex" --umitools_bc_pattern "^(?P
      .{12}).*” -profile docker -resume –max_cpus 54 –max_memory 120.GB –max_time 2400.h –aligner ‘star_salmon’ –pseudo_aligner ‘salmon’ –umitools_grouping_method ‘unique’ #–save_align_intermeds –save_unaligned –save_reference #sage nextflow run rnaseq/main.nf –input samplesheet.csv –outdir results_GRCh38 –genome GRCh38 –with_umi –umitools_extract_method “regex” –umitools_bc_pattern “^(?P .{12}).*” -profile test_full -resume –max_memory 256.GB –max_time 2400.h –aligner ‘star_salmon’ –pseudo_aligner ‘salmon’ #–save_align_intermeds –save_unaligned –save_reference #–Virus– /usr/local/bin/nextflow run rnaseq/main.nf –input samplesheet.csv –outdir results_virus –fasta “/home/jhuang/DATA/Data_Manja_RNAseq_Organoids_Virus/X14112.1.fasta” –gtf “/home/jhuang/DATA/Data_Manja_RNAseq_Organoids_Virus/X14112.1_v4.gtf” –with_umi –umitools_extract_method “regex” –umitools_bc_pattern “^(?P .{12}).*” –umitools_dedup_stats –skip_rseqc –skip_dupradar –skip_preseq -profile test_full -resume –max_cpus 55 –max_memory 120.GB –max_time 2400.h –save_align_intermeds –save_unaligned –save_reference –aligner ‘hisat2’ –gtf_extra_attributes ‘gene_name’ –gtf_group_features ‘gene_id’ –featurecounts_group_type ‘gene_name’ –featurecounts_feature_type ‘exon’ –umitools_grouping_method ‘unique’
  3. R-code for evaluation of nextflow outputs

    # Import the required libraries
    library("AnnotationDbi")
    library("clusterProfiler")
    library("ReactomePA")
    library(gplots)
    
    library(tximport)
    library(DESeq2)
    
    setwd("~/DATA/Data_Manja_RNAseq_Organoids/results_GRCh38_unique/star_salmon")
    
    # Define paths to your Salmon output quantification files
    files <- c("control_r1" = "./control_r1/quant.sf",
              "control_r2" = "./control_r2/quant.sf",
              "HSV.d2_r1" = "./HSV.d2_r1/quant.sf",
              "HSV.d2_r2" = "./HSV.d2_r2/quant.sf",
              "HSV.d4_r1" = "./HSV.d4_r1/quant.sf",
              "HSV.d4_r2" = "./HSV.d4_r2/quant.sf",
              "HSV.d6_r1" = "./HSV.d6_r1/quant.sf",
              "HSV.d6_r2" = "./HSV.d6_r2/quant.sf",
              "HSV.d8_r1" = "./HSV.d8_r1/quant.sf",
              "HSV.d8_r2" = "./HSV.d8_r2/quant.sf")
    
    # Import the transcript abundance data with tximport
    txi <- tximport(files, type = "salmon", txIn = TRUE, txOut = TRUE)
    
    # Define the replicates and condition of the samples
    replicate <- factor(c("r1", "r2", "r1", "r2", "r1", "r2", "r1", "r2", "r1", "r2"))
    condition <- factor(c("control", "control", "HSV.d2", "HSV.d2", "HSV.d4", "HSV.d4", "HSV.d6", "HSV.d6", "HSV.d8", "HSV.d8"))
    
    # Define the colData for DESeq2
    colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files))
    
    # Create DESeqDataSet object
    dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition)
    
    # In the context of your new code which is using tximport and DESeq2, you don't necessarily need this step. The reason is that DESeq2 performs its own filtering of low-count genes during the normalization and differential expression steps.
    # Filter data to retain only genes with more than 2 counts > 3 across all samples
    # dds <- dds[rowSums(counts(dds) > 3) > 2, ]
    
    # Run DESeq2
    dds <- DESeq(dds)
    
    # Perform rlog transformation
    rld <- rlogTransformation(dds)
    
    # Output raw count data to a CSV file
    write.csv(counts(dds), file="transcript_counts.csv")
    
    # -- gene-level count data --
    # Read in the tx2gene map from salmon_tx2gene.tsv
    #tx2gene <- read.csv("salmon_tx2gene.tsv", sep="\t", header=FALSE)
    tx2gene <- read.table("salmon_tx2gene.tsv", header=FALSE, stringsAsFactors=FALSE)
    
    # Set the column names
    colnames(tx2gene) <- c("transcript_id", "gene_id", "gene_name")
    
    # Remove the gene_name column if not needed
    tx2gene <- tx2gene[,1:2]
    
    # Import and summarize the Salmon data with tximport
    txi <- tximport(files, type = "salmon", tx2gene = tx2gene, txOut = FALSE)
    
    # Continue with the DESeq2 workflow as before...
    colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files))
    dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition)
    #dds <- dds[rowSums(counts(dds) > 3) > 2, ]    #60605-->26543
    dds <- DESeq(dds)
    rld <- rlogTransformation(dds)
    write.csv(counts(dds, normalized=FALSE), file="gene_counts.csv")
    
    #TODO: why a lot of reads were removed due to the too_short?
    #STAR --runThreadN 4 --genomeDir /path/to/GenomeDir --readFilesIn /path/to/read1.fastq /path/to/read2.fastq --outFilterMatchNmin 50 --outSAMtype BAM SortedByCoordinate --outFileNamePrefix /path/to/output
    
    dim(counts(dds))
    head(counts(dds), 10)  
  4. draw 3D PCA plots.

    library(gplots) 
    library("RColorBrewer")
    
    library(ggplot2)
    data <- plotPCA(rld, intgroup=c("condition", "replicate"), returnData=TRUE)
    write.csv(data, file="plotPCA_data.csv")
    #calculate all PCs including PC3 with the following codes
    library(genefilter)
    ntop <- 500
    rv <- rowVars(assay(rld))
    select <- order(rv, decreasing = TRUE)[seq_len(min(ntop, length(rv)))]
    mat <- t( assay(rld)[select, ] )
    pc <- prcomp(mat)
    pc$x[,1:3]
    #df_pc <- data.frame(pc$x[,1:3])
    df_pc <- data.frame(pc$x)
    identical(rownames(data), rownames(df_pc)) #-->TRUE
    
    data$PC1 <- NULL
    data$PC2 <- NULL
    merged_df <- merge(data, df_pc, by = "row.names")
    #merged_df <- merged_df[, -1]
    row.names(merged_df) <- merged_df$Row.names
    merged_df$Row.names <- NULL  # remove the "name" column
    merged_df$name <- NULL
    merged_df <- merged_df[, c("PC1","PC2","PC3","PC4","PC5","PC6","PC7","PC8","PC9","PC10","group","condition","replicate")]
    write.csv(merged_df, file="merged_df_10PCs.csv")
    summary(pc)  
    #0.5333  0.2125 0.06852
    
    draw_3D.py
    
    # -- before pca --
    png("pca_before_removeBatch2.png", 1200, 800)
    plotPCA(rld, intgroup=c("condition"))
    dev.off()
    
    # -- before heatmap --
    png("heatmap_before_removeBatch2.png", 1200, 800)
    distsRL <- dist(t(assay(rld)))
    mat <- as.matrix(distsRL)
    hc <- hclust(distsRL)
    hmcol <- colorRampPalette(brewer.pal(9,"GnBu"))(100)
    heatmap.2(mat, Rowv=as.dendrogram(hc),symm=TRUE, trace="none",col = rev(hmcol), margin=c(13, 13))
    dev.off()
    
    mat <- assay(rld)
    mm <- model.matrix(~replicates, colData(rld))
    mat <- limma::removeBatchEffect(mat, batch=rld$batch, design=mm)
    assay(rld) <- mat
    
    # -- after pca --
    png("pca_after_removeBatch.png", 1200, 800)
    #svg("pca_after_removeBatch.svg")
    plotPCA(rld, intgroup=c("replicates"))
    dev.off()
    
    # -- after heatmap --
    png("heatmap_after_removeBatch.png", 1200, 800)
    #svg("heatmap_after_removeBatch.svg")
    distsRL <- dist(t(assay(rld)))
    mat <- as.matrix(distsRL)
    hc <- hclust(distsRL)
    hmcol <- colorRampPalette(brewer.pal(9,"GnBu"))(100)
    heatmap.2(mat, Rowv=as.dendrogram(hc),symm=TRUE, trace="none",col = rev(hmcol), margin=c(13, 13))
    dev.off()
  5. (optional) estimate size factors

    > head(dds)
    class: DESeqDataSet 
    dim: 6 10 
    metadata(1): version
    assays(6): counts avgTxLength ... H cooks
    rownames(6): ENSG00000000003 ENSG00000000005 ... ENSG00000000460
      ENSG00000000938
    rowData names(34): baseMean baseVar ... deviance maxCooks
    colnames(10): control_r1 control_r2 ... HSV.d8_r1 HSV.d8_r2
    colData names(2): condition replicate
    
    #convert bam to bigwig using deepTools by feeding inverse of DESeq’s size Factor
    sizeFactors(dds)
    #NULL
    dds <- estimateSizeFactors(dds)
    > sizeFactors(dds)
    
    normalized_counts <- counts(dds, normalized=TRUE)
    #write.table(normalized_counts, file="normalized_counts.txt", sep="\t", quote=F, col.names=NA)
    
    # ---- DEBUG sizeFactors(dds) always NULL, see https://support.bioconductor.org/p/97676/ ----
    nm <- assays(dds)[["avgTxLength"]]
    sf <- estimateSizeFactorsForMatrix(counts(dds), normMatrix=nm)
    
    assays(dds)$counts  # for count data
    assays(dds)$avgTxLength  # for average transcript length, etc.
    assays(dds)$normalizationFactors
    
    In normal circumstances, the size factors should be stored in the DESeqDataSet object itself and not in the assays, so they are typically not retrievable via the assays() function. However, due to the issues you're experiencing, you might be able to manually compute the size factors and assign them back to the DESeqDataSet.
    
    To calculate size factors manually, DESeq2 uses the median ratio method. Here's a very simplified version of how you could compute this manually:
    > assays(dds)
    List of length 6
    names(6): counts avgTxLength normalizationFactors mu H cooks
    
    To calculate size factors manually, DESeq2 uses the median ratio method. Here's a very simplified version of how you could compute this manually:
    
    geoMeans <- apply(assays(dds)$counts, 1, function(row) if (all(row == 0)) 0 else exp(mean(log(row[row != 0]))))
    sizeFactors(dds) <- median(assays(dds)$counts / geoMeans, na.rm = TRUE)
    
    # ---- DEBUG END ----
    
    #unter konsole
    #  control_r1  ...
    # 1/0.9978755  ... 
    
    > sizeFactors(dds)
                        HeLa_TO_r1                      HeLa_TO_r2 
                          0.9978755                       1.1092227 
    
    1/0.9978755=1.002129023
    1/1.1092227=
    
    #bamCoverage --bam ../markDuplicates/${sample}Aligned.sortedByCoord.out.bam -o ${sample}_norm.bw --binSize 10 --scaleFactor  --effectiveGenomeSize 2864785220
    bamCoverage --bam ../markDuplicates/HeLa_TO_r1Aligned.sortedByCoord.out.markDups.bam -o HeLa_TO_r1.bw --binSize 10 --scaleFactor 1.002129023     --effectiveGenomeSize 2864785220
    bamCoverage --bam ../markDuplicates/HeLa_TO_r2Aligned.sortedByCoord.out.markDups.bam -o HeLa_TO_r2.bw --binSize 10 --scaleFactor  0.901532217        --effectiveGenomeSize 2864785220
  6. differential expressions

    #A central method for exploring differences between groups of segments or samples is to perform differential gene expression analysis. 
    
    dds$condition <- relevel(dds$condition, "control")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("HSV.d2_vs_control","HSV.d4_vs_control","HSV.d6_vs_control","HSV.d8_vs_control")
    
    dds$condition <- relevel(dds$condition, "HSV.d2")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("HSV.d4_vs_HSV.d2","HSV.d6_vs_HSV.d2","HSV.d8_vs_HSV.d2")
    
    dds$condition <- relevel(dds$condition, "HSV.d4")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("HSV.d6_vs_HSV.d4","HSV.d8_vs_HSV.d4")
    
    dds$condition <- relevel(dds$condition, "HSV.d6")
    dds = DESeq(dds, betaPrior=FALSE)
    resultsNames(dds)
    clist <- c("HSV.d8_vs_HSV.d6")
    
    ##https://bioconductor.statistik.tu-dortmund.de/packages/3.7/data/annotation/
    #BiocManager::install("EnsDb.Mmusculus.v79")
    #library(EnsDb.Mmusculus.v79)
    #edb <- EnsDb.Mmusculus.v79
    
    #https://bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/accessing_ensembl.html#selecting-an-ensembl-biomart-database-and-dataset
    #https://bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/accessing_ensembl.html#selecting-an-ensembl-biomart-database-and-dataset
    library(biomaRt)
    listEnsembl()
    listMarts()
    #ensembl <- useEnsembl(biomart = "genes", mirror="asia")  # default is Mouse strains 104
    #ensembl <- useEnsembl(biomart = "ensembl", dataset = "mmusculus_gene_ensembl", mirror = "www")
    #ensembl = useMart("ensembl_mart_44", dataset="hsapiens_gene_ensembl",archive=TRUE, mysql=TRUE)
    #ensembl <- useEnsembl(biomart = "ensembl", dataset = "mmusculus_gene_ensembl", version="104")
    #ensembl <- useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", version="86")
    #--> total 69, 27  GRCh38.p7 and 39  GRCm38.p4; we should take 104, since rnaseq-pipeline is also using annotation of 104!
    ensembl <- useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", version="104")
    datasets <- listDatasets(ensembl)
    #--> total 202   80                         GRCh38.p13         107                            GRCm39
    #80           hsapiens_gene_ensembl                                      Human genes (GRCh38.p13)                         GRCh38.p13
    #107         mmusculus_gene_ensembl                                        Mouse genes (GRCm39)                            GRCm39
    
    > listEnsemblArchives()
                name     date                                 url version
    1  Ensembl GRCh37 Feb 2014          https://grch37.ensembl.org  GRCh37
    2     Ensembl 109 Feb 2023 https://feb2023.archive.ensembl.org     109
    3     Ensembl 108 Oct 2022 https://oct2022.archive.ensembl.org     108
    4     Ensembl 107 Jul 2022 https://jul2022.archive.ensembl.org     107
    5     Ensembl 106 Apr 2022 https://apr2022.archive.ensembl.org     106
    6     Ensembl 105 Dec 2021 https://dec2021.archive.ensembl.org     105
    7     Ensembl 104 May 2021 https://may2021.archive.ensembl.org     104
    
    attributes = listAttributes(ensembl)
    attributes[1:25,]
    
    #https://www.ncbi.nlm.nih.gov/grc/human
    #BiocManager::install("org.Mmu.eg.db")
    #library("org.Mmu.eg.db")
    #edb <- org.Mmu.eg.db
    #
    #https://bioconductor.statistik.tu-dortmund.de/packages/3.6/data/annotation/
    #EnsDb.Mmusculus.v79
    #> query(hub, c("EnsDb", "apiens", "98"))
    #columns(edb)
    
    #searchAttributes(mart = ensembl, pattern = "symbol")
    
    ##https://www.geeksforgeeks.org/remove-duplicate-rows-based-on-multiple-columns-using-dplyr-in-r/
    library(dplyr)
    library(tidyverse)
    #df <- data.frame (lang =c ('Java','C','Python','GO','RUST','Javascript',
                          'Cpp','Java','Julia','Typescript','Python','GO'),
                          value = c (21,21,3,5,180,9,12,20,6,0,3,6),
                          usage =c(21,21,0,99,44,48,53,16,6,8,0,6))
    #distinct(df, lang, .keep_all= TRUE)
    
    for (i in clist) {
    #"HSV.d2_vs_control","HSV.d4_vs_control","HSV.d6_vs_control","HSV.d8_vs_control"
    #i<-clist[1]
      contrast = paste("condition", i, sep="_")
      res = results(dds, name=contrast)
      res <- res[!is.na(res$log2FoldChange),]
      #geness <- AnnotationDbi::select(edb86, keys = rownames(res), keytype = "GENEID", columns = c("ENTREZID","EXONID","GENEBIOTYPE","GENEID","GENENAME","PROTEINDOMAINSOURCE","PROTEINID","SEQNAME","SEQSTRAND","SYMBOL","TXBIOTYPE","TXID","TXNAME","UNIPROTID"))
      #geness <- AnnotationDbi::select(edb86, keys = rownames(res), keytype = "GENEID", columns = c("GENEID", "ENTREZID", "SYMBOL", "GENENAME","GENEBIOTYPE","TXBIOTYPE","SEQSTRAND","UNIPROTID"))
      # In the ENSEMBL-database, GENEID is ENSEMBL-ID.
      #geness <- AnnotationDbi::select(edb86, keys = rownames(res), keytype = "GENEID", columns = c("GENEID", "SYMBOL", "GENEBIOTYPE"))  #  "ENTREZID", "TXID","TXBIOTYPE","TXSEQSTART","TXSEQEND"
      #geness <- geness[!duplicated(geness$GENEID), ]
    
      #using getBM replacing AnnotationDbi::select
      #filters = 'ensembl_gene_id' means the records should always have a valid ensembl_gene_ids.
      geness <- getBM(attributes = c('ensembl_gene_id', 'external_gene_name', 'gene_biotype', 'entrezgene_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'description'),
          filters = 'ensembl_gene_id',
          values = rownames(res), 
          mart = ensembl)
      geness_uniq <- distinct(geness, ensembl_gene_id, .keep_all= TRUE)
    
      #merge by column by common colunmn name, in the case "GENEID"
      res$ENSEMBL = rownames(res)
      identical(rownames(res), geness_uniq$ensembl_gene_id)
      res_df <- as.data.frame(res)
      geness_res <- merge(geness_uniq, res_df, by.x="ensembl_gene_id", by.y="ENSEMBL")
      dim(geness_res)
      rownames(geness_res) <- geness_res$ensembl_gene_id
      geness_res$ensembl_gene_id <- NULL
      write.csv(as.data.frame(geness_res[order(geness_res$pvalue),]), file = paste(i, "all.txt", sep="-"))
      up <- subset(geness_res, padj<=0.05 & log2FoldChange>=2)
      down <- subset(geness_res, padj<=0.05 & log2FoldChange<=-2)
      write.csv(as.data.frame(up[order(up$log2FoldChange,decreasing=TRUE),]), file = paste(i, "up.txt", sep="-"))
      write.csv(as.data.frame(down[order(abs(down$log2FoldChange),decreasing=TRUE),]), file = paste(i, "down.txt", sep="-"))
    }
    
    #-- show methods of class DESeq2 --
    #x=capture.output(showMethods(class="DESeq2"))
    #unlist(lapply(strsplit(x[grep("Function: ",x,)]," "),function(x) x[2]))
  7. volcano plots with automatically finding top_g

    #A canonical visualization for interpreting differential gene expression results is the volcano plot.
    library(ggrepel) 
    
    for i in HSV.d2_vs_control HSV.d4_vs_control HSV.d6_vs_control HSV.d8_vs_control HSV.d4_vs_HSV.d2 HSV.d6_vs_HSV.d2 HSV.d8_vs_HSV.d2 HSV.d6_vs_HSV.d4 HSV.d8_vs_HSV.d4 HSV.d8_vs_HSV.d6; do
    #HSV.d4_vs_control HSV.d6_vs_control HSV.d8_vs_control
    #for i in K3R_24hdox_vs_K3R_3hdox21hchase WT_3hdox21hchase_vs_K3R_3hdox21hchase; do
    #for i in WT_24hdox_vs_K3R_24hdox; do
    #for i in WT_24hdox_vs_WT_3hdox21hchase; do
      # read files to geness_res
      echo "geness_res <- read.csv(file = paste(\"${i}\", \"all.txt\", sep=\"-\"), row.names=1)"
    
      echo "subset(geness_res, external_gene_name %in% top_g & pvalue < 0.05 & (abs(geness_res\$log2FoldChange) >= 2.0))"
      echo "geness_res\$Color <- \"NS or log2FC < 2.0\""
      echo "geness_res\$Color[geness_res\$pvalue < 0.05] <- \"P < 0.05\""
      echo "geness_res\$Color[geness_res\$padj < 0.05] <- \"P-adj < 0.05\""
      echo "geness_res\$Color[abs(geness_res\$log2FoldChange) < 2.0] <- \"NS or log2FC < 2.0\""
      echo "geness_res\$Color <- factor(geness_res\$Color, levels = c(\"NS or log2FC < 2.0\", \"P < 0.05\", \"P-adj < 0.05\"))"
      echo "write.csv(geness_res, \"${i}_with_Category.csv\")"
    
      # pick top genes for either side of volcano to label
      # order genes for convenience:
      echo "geness_res\$invert_P <- (-log10(geness_res\$pvalue)) * sign(geness_res\$log2FoldChange)"
      echo "top_g <- c()"
      echo "top_g <- c(top_g, \
                geness_res[, 'external_gene_name'][order(geness_res[, 'invert_P'], decreasing = TRUE)[1:100]], \
                geness_res[, 'external_gene_name'][order(geness_res[, 'invert_P'], decreasing = FALSE)[1:100]])"
      echo "top_g <- unique(top_g)"
      echo "geness_res <- geness_res[, -1*ncol(geness_res)]"  # remove invert_P from matrix
    
      # Graph results
      echo "png(\"${i}.png\",width=1200, height=2000)"
      echo "ggplot(geness_res, \
          aes(x = log2FoldChange, y = -log10(pvalue), \
              color = Color, label = external_gene_name)) + \
          geom_vline(xintercept = c(2.0, -2.0), lty = \"dashed\") + \
          geom_hline(yintercept = -log10(0.05), lty = \"dashed\") + \
          geom_point() + \
          labs(x = \"log2(FC)\", y = \"Significance, -log10(P)\", color = \"Significance\") + \
          scale_color_manual(values = c(\"P-adj < 0.05\"=\"darkblue\",\"P < 0.05\"=\"lightblue\",\"NS or log2FC < 2.0\"=\"darkgray\"),guide = guide_legend(override.aes = list(size = 4))) + scale_y_continuous(expand = expansion(mult = c(0,0.05))) + \
          geom_text_repel(data = subset(geness_res, external_gene_name %in% top_g & pvalue < 0.05 & (abs(geness_res\$log2FoldChange) >= 2.0)), size = 4, point.padding = 0.15, color = \"black\", min.segment.length = .1, box.padding = .2, lwd = 2) + \
          theme_bw(base_size = 16) + \
          theme(legend.position = \"bottom\")"
      echo "dev.off()"
    done
    
    sed -i -e 's/Color/Category/g' *_Category.csv
    
    for i in HSV.d2_vs_control HSV.d4_vs_control HSV.d6_vs_control HSV.d8_vs_control HSV.d4_vs_HSV.d2 HSV.d6_vs_HSV.d2 HSV.d8_vs_HSV.d2 HSV.d6_vs_HSV.d4 HSV.d8_vs_HSV.d4 HSV.d8_vs_HSV.d6; do
      echo "~/Tools/csv2xls-0.4/csv_to_xls.py ${i}-all.txt ${i}-up.txt ${i}-down.txt -d$',' -o ${i}.xls;"
    done
  8. clustering the genes and draw heatmap

    install.packages("gplots")
    library("gplots")
    
    for i in HSV.d2_vs_control HSV.d4_vs_control HSV.d6_vs_control HSV.d8_vs_control HSV.d4_vs_HSV.d2 HSV.d6_vs_HSV.d2 HSV.d8_vs_HSV.d2 HSV.d6_vs_HSV.d4 HSV.d8_vs_HSV.d4 HSV.d8_vs_HSV.d6; do
      echo "cut -d',' -f1-1 ${i}-up.txt > ${i}-up.id"
      echo "cut -d',' -f1-1 ${i}-down.txt > ${i}-down.id"
    done
    
        5 HSV.d2_vs_control-down.id
        14 HSV.d2_vs_control-up.id
        77 HSV.d4_vs_control-down.id
      460 HSV.d4_vs_control-up.id
    
      977 HSV.d6_vs_control-down.id
      1863 HSV.d6_vs_control-up.id
      1361 HSV.d8_vs_control-down.id
      1215 HSV.d8_vs_control-up.id
    
        35 HSV.d4_vs_HSV.d2-down.id
      205 HSV.d4_vs_HSV.d2-up.id
    
      832 HSV.d6_vs_HSV.d2-down.id
      1550 HSV.d6_vs_HSV.d2-up.id
      386 HSV.d6_vs_HSV.d4-down.id
      103 HSV.d6_vs_HSV.d4-up.id
    
      1136 HSV.d8_vs_HSV.d2-down.id
      1050 HSV.d8_vs_HSV.d2-up.id
      598 HSV.d8_vs_HSV.d4-down.id
      292 HSV.d8_vs_HSV.d4-up.id
      305 HSV.d8_vs_HSV.d6-down.id
      133 HSV.d8_vs_HSV.d6-up.id
    12597 total
    
    cat *.id | sort -u > ids
    #add Gene_Id in the first line, delete the ""
    GOI <- read.csv("ids")$Gene_Id
    RNASeq.NoCellLine <- assay(rld)
    
    #clustering methods: "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).  pearson or spearman
    datamat = RNASeq.NoCellLine[GOI, ]
    write.csv(as.data.frame(datamat), file ="significant_gene_expressions.txt")
    hr <- hclust(as.dist(1-cor(t(datamat), method="pearson")), method="complete")
    hc <- hclust(as.dist(1-cor(datamat, method="spearman")), method="complete")
    mycl = cutree(hr, h=max(hr$height)/1.05)
    mycol = c("YELLOW", "DARKBLUE", "DARKORANGE", "DARKMAGENTA", "DARKCYAN", "DARKRED",  "MAROON", "DARKGREEN", "LIGHTBLUE", "PINK", "MAGENTA", "LIGHTCYAN","LIGHTGREEN", "BLUE", "ORANGE", "CYAN", "RED", "GREEN");
    
    mycol = mycol[as.vector(mycl)]
    sampleCols <- rep('GREY',ncol(datamat))
    names(sampleCols) <- c("control r1","control r2","day2 r1","day2 r2","day4 r1","day4 r2", "day6 r1","day6 r2", "day8 r1","day8 r2")
    #sampleCols[substr(colnames(RNASeq.NoCellLine_),1,4)=='mock'] <- 'GREY'
    
    sampleCols["control r1"] <- 'DARKBLUE'
    sampleCols["control r2"] <- 'DARKBLUE'
    sampleCols["day2 r1"] <- 'DARKRED'
    sampleCols["day2 r2"] <- 'DARKRED'
    sampleCols["day4 r1"] <- 'DARKORANGE'
    sampleCols["day4 r2"] <- 'DARKORANGE'
    sampleCols["day6 r1"] <- 'DARKGREEN'
    sampleCols["day6 r2"] <- 'DARKGREEN'
    sampleCols["day8 r1"] <- 'DARKCYAN'
    sampleCols["day8 r2"] <- 'DARKCYAN'
    
    png("DEGs_heatmap.png", width=1000, height=1200)
    heatmap.2(as.matrix(datamat),Rowv=as.dendrogram(hr),Colv = NA, dendrogram = 'row',
                scale='row',trace='none',col=bluered(75), 
                RowSideColors = mycol, ColSideColors = sampleCols, labRow="", margins=c(22,10), cexRow=8, cexCol=2, srtCol=20, lwid=c(1,7), lhei = c(1, 8))
    #legend("top", title = "",legend=c("WaGa_RNA","MKL1_RNA","WaGa_EV_RNA","MKL1_EV_RNA"), fill=c("DARKBLUE","DARKRED","DARKORANGE","DARKGREEN"), cex=0.8, box.lty=0)
    dev.off()
    
    #c("YELLOW", "DARKBLUE", "DARKORANGE", "DARKMAGENTA", "DARKCYAN", "DARKRED",  "MAROON", "DARKGREEN", "LIGHTBLUE", "PINK", "MAGENTA", "LIGHTCYAN","LIGHTGREEN", "BLUE", "ORANGE", "CYAN", "RED", "GREEN");
    write.csv(names(subset(mycl, mycl == '1')),file='cluster1_YELLOW.txt')
    write.csv(names(subset(mycl, mycl == '2')),file='cluster2_DARKBLUE.txt') 
    write.csv(names(subset(mycl, mycl == '3')),file='cluster3_DARKORANGE.txt')  
    write.csv(names(subset(mycl, mycl == '4')),file='cluster4_DARKMAGENTA.txt') 
    write.csv(names(subset(mycl, mycl == '5')),file='cluster5_DARKCYAN.txt')  
    #~/Tools/csv2xls-0.4/csv_to_xls.py cluster*.txt -d',' -o DEGs_heatmap_cluster_members.xls
    
    ~/Tools/csv2xls-0.4/csv_to_xls.py \
    significant_gene_expressions.txt \
    -d',' -o DEGs_heatmap_expression_data.xls;

Prepare virus GTF for nextflow run

  1. Prepare GTF for non-model virus

    • The gffread command you’re using is designed to convert GFF format files to GTF format, but it doesn’t necessarily preserve all the attribute information. The -T option enforces creation of gene_id and transcript_id attributes, which are mandatory in GTF format, and gffread takes these from the ID and Parent fields of the input GFF file, respectively.

    • The GTF format is simpler than GFF and doesn’t accommodate all the possible attributes of a GFF file. That’s why you’re seeing a reduction in information in your converted file.

    • If you need to retain all information from the GFF file, you may need to do some post-processing to add the extra attributes back into the GTF file. However, keep in mind that downstream tools which expect GTF format may not correctly handle extra attributes.

      # -- Deprecated processing for virus gtf --
      #NOT_USED, since it changed a lot!
      #gffread X14112.1.gff -T -o X14112.1.gtf
      cp X14112.1.gff3 X14112.1.gff3_backup
      grep "^##" X14112.1.gff3 > X14112.1_gene.gff3
      grep "ID=gene" X14112.1.gff3 >> X14112.1_gene.gff
      #!!!!VERY_IMPORTANT!!!!: change type '\tgene\t' to '\texon\t'! 
      #sed -i -e "s/\tgene\t/\texon\t/g" X14112.1_gene_.gff # since default is --featurecounts_feature_type 'exon'
      
      # -- New processing for virus gtf --
      gffread X14112.1_orig.gff -T -o X14112.1_v2.gtf
      
      python3 add_gene_id.py  # X14112.1_v2.gtf --> X14112.1_v3.gtf
      #------------------------------------
      def add_missing_gene_id(input_gtf, output_gtf):
          with open(input_gtf, 'r') as in_gtf, open(output_gtf, 'w') as out_gtf:
              for line in in_gtf:
                  if not line.startswith('#'):  # Skip header lines
                      elements = line.strip().split('\t')
                      attributes = elements[8]
                      if 'gene_id' not in attributes:
                          # Extract transcript_id
                          transcript_id = ''
                          for attr in attributes.split(';'):
                              if 'transcript_id' in attr:
                                  transcript_id = attr.strip()
                          # Prepend transcript_id as gene_id if not empty
                          if transcript_id != '':
                              attributes = f'{transcript_id.replace("transcript_id", "gene_id")}; {attributes}'
                      elements[8] = attributes
                      line = '\t'.join(elements)
                  out_gtf.write(line + '\n')
      # Use the function
      input_gtf = 'X14112.1_v2.gtf'  # Path to your input GTF
      output_gtf = 'X14112.1_v3.gtf'  # Path to the output GTF
      add_missing_gene_id(input_gtf, output_gtf)
    • Human herpesvirus 1, also known as Herpes Simplex Virus type 1 (HSV-1), is a virus with a complex genome encoding around 70-80 genes. The number of genes can vary slightly depending on the specific strain of HSV-1, as well as the methodologies used to identify and annotate the genes.

    • IE175, also known as ICP4 (Infected Cell Polypeptide 4), is a protein encoded by the Human herpesvirus 1 (HSV-1). The gene for this protein is also referred to as the IE (immediate early) gene 3, and the protein it encodes is a major regulatory protein.

    • In the lifecycle of HSV-1, immediate early genes are the first set of genes to be transcribed following infection. The proteins produced from these genes then regulate the expression of early and late genes that are involved in viral DNA replication and the production of viral structural proteins.

    • ICP4, in particular, is essential for the onset of viral replication. It acts as a trans-activator, promoting transcription of other viral genes. It can also interact with host cell proteins and influence host gene expression. As a result of these functions, ICP4 plays a key role in the pathogenesis of HSV-1 infection.

    • Please note that the naming convention for viral genes and proteins can sometimes be inconsistent, with multiple names referring to the same gene or protein. IE175, ICP4, and IE gene 3 all refer to the same gene in HSV-1.

      # Delete the records if they are intron or manually add gene_name to the records without gene_name. 
      
      cp X14112.1_v3.gtf X14112.1_v4.gtf
      #Find all recoreds without "gene_name"
      grep -v "gene_name" X14112.1_v4.gtf
      
      #-->Delete intron records: grep "intron" X14112.1.gff3_orig
      DEL X14112.1        EMBL    transcript      4953    6907    .       -       .       transcript_id "id-X14112.1:4953..6907"; gene_id "id-X14112.1:4953..6907"
      DEL X14112.1        EMBL    exon    4953    6907    .       -       .       gene_id "id-X14112.1:4953..6907"; transcript_id "id-X14112.1:4953..6907";
      DEL X14112.1        EMBL    transcript      132374  132539  .       +       .       transcript_id "id-X14112.1:132374..132539"; gene_id "id-X14112.1:132374..132539"
      DEL X14112.1        EMBL    exon    132374  132539  .       +       .       gene_id "id-X14112.1:132374..132539"; transcript_id "id-X14112.1:132374..132539";
      DEL X14112.1        EMBL    transcript      145649  145860  .       -       .       transcript_id "id-X14112.1:145649..145860"; gene_id "id-X14112.1:145649..145860"
      DEL X14112.1        EMBL    exon    145649  145860  .       -       .       gene_id "id-X14112.1:145649..145860"; transcript_id "id-X14112.1:145649..145860";
      
      # or update: grep "146805" X14112.1_orig.gff
      UPDATE X14112.1        EMBL    transcript      146805  151063  .       +       .       transcript_id "rna-X14112.1:146805..151063"; gene_id "rna-X14112.1:146805..151063"
      UPDATE X14112.1        EMBL    exon    146805  151063  .       +       .       gene_id "rna-X14112.1:146805..151063"; transcript_id "rna-X14112.1:146805..151063";
                                                                                        --> transcript_id "rna-IE175"; gene_id "gene-IE175"; gene_name "IE175";                                                  --> transcript_id "rna-IE175"; gene_id "gene-IE175"; gene_name "IE175";
      
      # or update: grep "133941" X14112.1_orig.gff
      UPDATE X14112.1        EMBL    transcript      133941  146107  .       -       .       transcript_id "rna-X14112.1:133941..146107"; gene_id "rna-X14112.1:133941..146107"
      UPDATE X14112.1        EMBL    exon    133941  145648  .       -       .       gene_id "rna-X14112.1:133941..146107"; transcript_id "rna-X14112.1:133941..146107";
      UPDATE X14112.1        EMBL    exon    145861  146107  .       -       .       gene_id "rna-X14112.1:133941..146107"; transcript_id "rna-X14112.1:133941..146107";
                                                                                        --> transcript_id "rna-IE68"; gene_id "rna-IE68"; gene_name "IE68";
                                                                                        --> gene_id "rna-IE68"; transcript_id "rna-IE68"; gene_name "IE68";
                                                                                        --> gene_id "rna-IE68"; transcript_id "rna-IE68"; gene_name "IE68";
    • (optional) consider to update all exon and CDS with different names! for example exon-RL2-1, exon-RL2-2, cds-RL2-1. Maybe it is not nessary, since the output contains only transcript-type!

  2. Run nexflow for virus

    docker pull nfcore/rnaseq
    /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results_virus    --fasta "/home/jhuang/DATA/Data_Manja_RNAseq_Organoids_Virus/X14112.1.fasta" --gtf "/home/jhuang/DATA/Data_Manja_RNAseq_Organoids_Virus/X14112.1_v4.gtf"   --with_umi --umitools_extract_method "regex" --umitools_bc_pattern "^(?P
    .{12}).*” –umitools_dedup_stats –skip_rseqc –skip_dupradar –skip_preseq -profile docker -resume –max_cpus 55 –max_memory 120.GB –max_time 2400.h –save_align_intermeds –save_unaligned –save_reference –aligner ‘hisat2’ –gtf_extra_attributes ‘gene_name’ –gtf_group_features ‘gene_id’ –featurecounts_group_type ‘gene_name’ –featurecounts_feature_type ‘exon’ –umitools_grouping_method ‘unique’
  3. Run nexflow for human using GRCh38 genome

    docker pull nfcore/rnaseq
    /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results_GRCh38 --genome GRCh38   --with_umi --umitools_extract_method "regex" --umitools_bc_pattern "^(?P
    .{12}).*” –umitools_dedup_stats –skip_rseqc –skip_dupradar –skip_preseq -profile docker -resume –max_cpus 55 –max_memory 128.GB –max_time 2400.h –save_align_intermeds –save_unaligned –save_reference –aligner ‘star_salmon’ –pseudo_aligner ‘salmon’ –gtf_extra_attributes ‘gene_name’ –gtf_group_features ‘gene_id’ –featurecounts_group_type ‘gene_biotype’ –featurecounts_feature_type ‘exon’ –umitools_grouping_method ‘unique’

3.1. BUG_1 for running d8_r1 due to memory

  # in modules/nf-core/umitools/dedup/main.nf
  process UMITOOLS_DEDUP {
      tag "$meta.id"
      //REMOVED  label "process_medium"
      //ADDED
      label 'high_memory' // this needs to be defined in your config file
      cpus 55 // adjust as per your system's capabilities

  ERROR ~ Module compilation error
  - file : /mnt/h1/jhuang/DATA/Data_Manja_RNAseq_Organoids/rnaseq/./workflows/../subworkflows/nf-core/bam_dedup_stats_samtools_umitools/../../../modules/nf-core/umitools/dedup/main.nf
  - cause: Unexpected character: '#' @ line 3, column 5.
        #label "process_medium"
      ^

3.2. BUG_2 for running d8_r1 due to memory

  # in conf/test_full.config
  process {
    //ADDED
    withLabel: 'high_memory' {
      memory = '120 GB' // adjust as per your system's capabilities
    }
    withName: 'UMITOOLS_DEDUP' {
      time = '160.h' // Adjust the time limit to your needs
    }
  }

  ERROR ~ Error executing process > 'NFCORE_RNASEQ:RNASEQ:BAM_DEDUP_STATS_SAMTOOLS_UMITOOLS_TRANSCRIPTOME:UMITOOLS_DEDUP (control_r2)'
  Caused by:
    Process requirement exceeds available memory -- req: 128 GB; avail: 125.8 GB
  Command executed:
    PYTHONHASHSEED=0 umi_tools \
        dedup \
        -I control_r2.transcriptome.sorted.bam \
        -S control_r2.umi_dedup.transcriptome.sorted.bam \
        --output-stats control_r2.umi_dedup.transcriptome.sorted \
        --method='unique' --random-seed=100
    cat <<-END_VERSIONS > versions.yml
    "NFCORE_RNASEQ:RNASEQ:BAM_DEDUP_STATS_SAMTOOLS_UMITOOLS_TRANSCRIPTOME:UMITOOLS_DEDUP":
        umitools: $(umi_tools --version 2>&1 | sed 's/^.*UMI-tools version://; s/ *$//')
    END_VERSIONS
  Command exit status:
  1. R-code for evaluation of nextflow outputs

    # Import the required libraries
    library("AnnotationDbi")
    library("clusterProfiler")
    library("ReactomePA")
    library(gplots)
    
    library(tximport)
    library(DESeq2)
    
    setwd("~/DATA/Data_Manja_RNAseq_Organoids/results_GRCh38_unique_9samples/star_salmon")
    
    # Define paths to your Salmon output quantification files
    files <- c("control_r1" = "./control_r1/quant.sf",
              "control_r2" = "./control_r2/quant.sf",
              "HSV.d2_r1" = "./HSV.d2_r1/quant.sf",
              "HSV.d2_r2" = "./HSV.d2_r2/quant.sf",
              "HSV.d4_r1" = "./HSV.d4_r1/quant.sf",
              "HSV.d4_r2" = "./HSV.d4_r2/quant.sf",
              "HSV.d6_r1" = "./HSV.d6_r1/quant.sf",
              "HSV.d6_r2" = "./HSV.d6_r2/quant.sf",
              "HSV.d8_r2" = "./HSV.d8_r2/quant.sf")
    
    # Import the transcript abundance data with tximport
    txi <- tximport(files, type = "salmon", txIn = TRUE, txOut = TRUE)
    
    # Define the replicates and condition of the samples
    replicate <- factor(c("r1", "r2", "r1", "r2", "r1", "r2", "r1", "r2", "r2"))
    condition <- factor(c("control", "control", "HSV.d2", "HSV.d2", "HSV.d4", "HSV.d4", "HSV.d6", "HSV.d6", "HSV.d8"))
    
    # Define the colData for DESeq2
    colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files))
    
    # Create DESeqDataSet object
    dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition)
    
    # In the context of your new code which is using tximport and DESeq2, you don't necessarily need this step. The reason is that DESeq2 performs its own filtering of low-count genes during the normalization and differential expression steps.
    # Filter data to retain only genes with more than 2 counts > 3 across all samples
    # dds <- dds[rowSums(counts(dds) > 3) > 2, ]
    
    # Run DESeq2
    dds <- DESeq(dds)
    
    # Perform rlog transformation
    rld <- rlogTransformation(dds)
    
    # Output raw count data to a CSV file
    write.csv(counts(dds), file="transcript_counts.csv")
    
    # -- gene-level count data --
    # Read in the tx2gene map from salmon_tx2gene.tsv
    #tx2gene <- read.csv("salmon_tx2gene.tsv", sep="\t", header=FALSE)
    tx2gene <- read.table("salmon_tx2gene.tsv", header=FALSE, stringsAsFactors=FALSE)
    
    # Set the column names
    colnames(tx2gene) <- c("transcript_id", "gene_id", "gene_name")
    
    # Remove the gene_name column if not needed
    tx2gene <- tx2gene[,1:2]
    
    # Import and summarize the Salmon data with tximport
    txi <- tximport(files, type = "salmon", tx2gene = tx2gene, txOut = FALSE)
    
    # Continue with the DESeq2 workflow as before...
    colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files))
    dds <- DESeqDataSetFromTximport(txi, colData=colData, design=~condition)
    #dds <- dds[rowSums(counts(dds) > 3) > 2, ]    #60605-->26543
    dds <- DESeq(dds)
    rld <- rlogTransformation(dds)
    write.csv(counts(dds, normalized=FALSE), file="gene_counts.csv")
    
    #TODO: why a lot of reads were removed due to the too_short?
    STAR --runThreadN 4 --genomeDir /path/to/GenomeDir --readFilesIn /path/to/read1.fastq /path/to/read2.fastq --outFilterMatchNmin 50 --outSAMtype BAM SortedByCoordinate --outFileNamePrefix /path/to/output
    
    dim(counts(dds))
    head(counts(dds), 10)  

WHY: too much “Unmapped: too short” in STAR using nextflow, I read the STAR manual, the default values are 0, it means it will never discard reads due to too short. ASSUMPTION: the umi_tools dedup counts as “Unmapped: too short”. Check the intermediate bam files.

Indeed, you’re right that according to the STAR manual, the default settings for read length filtering parameters (–outFilterScoreMinOverLread, –outFilterMatchNminOverLread, and –outFilterMatchNmin) are set to 0, which means reads will not be discarded for being too short.

The “Unmapped: too short” statistic in the STAR output usually refers to reads that didn’t map well enough to the reference genome. In other words, either the read didn’t match sufficiently to any location in the genome, or the best matches were not significantly better than other matches to be unique.

If you’re seeing a high proportion of “Unmapped: too short” reads, it could be due to several reasons:

  • The quality of the reads might be poor, leading to low mapping efficiency.
  • The reference genome might not be the correct one for your data.
  • If your data is single-end reads, it might contain a high proportion of reads shorter than the default minimum allowed length.

For troubleshooting, you could:

Check the quality of your reads using a tool like FastQC.

  • Ensure you’re using the correct reference genome.
  • If you’re working with single-end data, consider checking the length distribution of your reads. If many reads are shorter than the minimum length threshold, consider adjusting this threshold if appropriate.
  • If you’re working with paired-end data, make sure both of the reads in a pair pass the quality controls. STAR is sensitive to inconsistencies between paired reads.

Finally, always make sure your read preprocessing steps (like trimming for quality and adapter sequences) are performed correctly as these can greatly affect the downstream analysis.

Yersinia outer proteins (Yops) analysis

  1. This step uses rsync to download data from the NCBI server to a local directory, save all gff-files in the directory prokka.

    rsync --copy-links --recursive --times --verbose rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/696/305/GCF_001696305.1_UCN72.1 Yersinia_pestis_1045
    
    GCF_001656035.1_ASM165603v1_genomic.fna.gz
    070 status=suppressed
    jhuang@hamburg:~/DATA/Data_Gunnar_Yersiniomics$ cp data/Yersinia_pseudotuberculosis_PB1+/GCF_000020085.1_ASM2008v1/GCF_000020085.1_ASM2008v1_genomic.fna.gz assembly/Yersinia_pseudotuberculosis_PB1+.fna.gz
    cp: cannot stat 'data/Yersinia_pseudotuberculosis_PB1+/GCF_000020085.1_ASM2008v1/GCF_000020085.1_ASM2008v1_genomic.fna.gz': No such file or directory
    088
    jhuang@hamburg:~/DATA/Data_Gunnar_Yersiniomics$ cp data/Yersinia_pseudotuberculosis_YPIII/GCF_000019465.1_ASM1946v1/GCF_000019465.1_ASM1946v1_genomic.fna.gz assembly/Yersinia_pseudotuberculosis_YPIII.fna.gz
    cp: cannot stat 'data/Yersinia_pseudotuberculosis_YPIII/GCF_000019465.1_ASM1946v1/GCF_000019465.1_ASM1946v1_genomic.fna.gz': No such file or directory
    
    #status=latest
    
    for sample in Yersinia_pestis_1045 Yersinia_pestis_SCPM-O-B-6291_C-25 Yersinia_pestis_2944 Yersinia_pestis_KIM10+ Yersinia_pestis_M-1482 Yersinia_pestis_KIM5 Yersinia_pestis_C-781 Yersinia_pestis_14D Yersinia_pestis_KM_567 Yersinia_pestis_M-1770 Yersinia_pestis_C-792 Yersinia_pestis_M2086 Yersinia_pestis_Harbin_35 Yersinia_pestis_Nicholisk_41 Yersinia_pestis_Harbin_35_bis Yersinia_pestis_SCPM-O-B-5935_I-1996 Yersinia_pestis_I-1252 Yersinia_pestis_FDAARGOS_603 Yersinia_pestis_195P Yersinia_pestis_Nepal516 Yersinia_pestis_S19960127 Yersinia_pestis_SCPM-O-B-6530 Yersinia_pestis_C-783 Yersinia_pestis_A1122 Yersinia_pestis_Cadman Yersinia_pestis_A1122_bis Yersinia_pestis_CO92_pgm-_pPCP1- Yersinia_pestis_CO92 Yersinia_pestis_Shasta Yersinia_pestis_Dodson Yersinia_pestis_El_Dorado Yersinia_pestis_EV76-CN Yersinia_pestis_EV_NIIEG Yersinia_pestis_Java9 Yersinia_pestis_PBM19 Yersinia_pestis_20 Yersinia_pestis_D182038 Yersinia_pestis_D106004 Yersinia_pestis_Z176003 Yersinia_pestis_Antiqua_bis Yersinia_pestis_FDAARGOS_601 Yersinia_pestis_Antiqua Yersinia_pestis_Nairobi Yersinia_pestis_M2085 Yersinia_pestis_SCPM-O-B-5942_I-2638 Yersinia_pestis_M2029 Yersinia_pestis_SCPM-O-DNA-18_I-3113 Yersinia_pestis_94 Yersinia_pestis_R Yersinia_pestis_790 Yersinia_pestis_SCPM-O-B-6899_231 Yersinia_pestis_FDAARGOS_602 Yersinia_pestis_Pestoides_B Yersinia_pestis_M-1974 Yersinia_pestis_91001 Yersinia_pestis_Angola Yersinia_pestis_Angola_bis Yersinia_pestis_3770 Yersinia_pestis_1412 Yersinia_pestis_1413 Yersinia_pestis_8787 Yersinia_pestis_3067 Yersinia_pestis_Pestoides_G Yersinia_pestis_Pestoides_F Yersinia_pestis_Pestoides_F_bis Yersinia_pestis_1522 Yersinia_pseudotuberculosis_FDAARGOS_582 Yersinia_pseudotuberculosis_NZYP4713 Yersinia_pseudotuberculosis_NCTC8480  Yersinia_pseudotuberculosis_PB1+_bis Yersinia_pseudotuberculosis_MD67 Yersinia_pseudotuberculosis_NCTC10217 Yersinia_pseudotuberculosis_NCTC10275 Yersinia_pseudotuberculosis_1 Yersinia_pseudotuberculosis_IP32953 Yersinia_pseudotuberculosis_IP32953_bis Yersinia_pseudotuberculosis_FDAARGOS_583 Yersinia_pseudotuberculosis_FDAARGOS_581 Yersinia_pseudotuberculosis_ATCC_6904 Yersinia_pseudotuberculosis_EP2+ Yersinia_pseudotuberculosis_IP31758 Yersinia_pseudotuberculosis_598 Yersinia_pseudotuberculosis_PA3606 Yersinia_pseudotuberculosis_FDAARGOS_665 Yersinia_pseudotuberculosis_FDAARGOS_584 Yersinia_pseudotuberculosis_YPIII_bis  Yersinia_pseudotuberculosis_FDAARGOS_579 Yersinia_pseudotuberculosis_IP2666pIB1 Yersinia_pseudotuberculosis_FDAARGOS_342 Yersinia_pseudotuberculosis_FDAARGOS_580 Yersinia_pseudotuberculosis_NCTC3571 Yersinia_similis_228 Yersinia_enterocolitica_NCTC13629 Yersinia_enterocolitica_MGYG-HGUT-02335 Yersinia_enterocolitica_Y1 Yersinia_enterocolitica_Y11 Yersinia_enterocolitica_NCTC13769 Yersinia_enterocolitica_FDAARGOS_1082 Yersinia_enterocolitica_2516-87 Yersinia_enterocolitica_KNG22703 Yersinia_enterocolitica_1055Rr Yersinia_enterocolitica_FDAARGOS_1090 Yersinia_enterocolitica_YE1 Yersinia_enterocolitica_YE3 Yersinia_enterocolitica_YE6 Yersinia_enterocolitica_YE7 Yersinia_enterocolitica_YE5 Yersinia_enterocolitica_YE165 Yersinia_enterocolitica_8081 Yersinia_enterocolitica_8081_bis Yersinia_enterocolitica_NCTC12982 Yersinia_enterocolitica_WA Yersinia_enterocolitica_NW57 Yersinia_enterocolitica_NW117 Yersinia_enterocolitica_NW51 Yersinia_enterocolitica_NW56 Yersinia_enterocolitica_NW115 Yersinia_enterocolitica_NW67 Yersinia_enterocolitica_FORC_002 Yersinia_enterocolitica_FORC_002_bis Yersinia_enterocolitica_NW66 Yersinia_enterocolitica_MP98 Yersinia_enterocolitica_Gp259 Yersinia_enterocolitica_FORC066 Yersinia_enterocolitica_Gp2 Yersinia_enterocolitica_str_YE5303 Yersinia_enterocolitica_Gp200 Yersinia_enterocolitica_NW116 Yersinia_enterocolitica_Gp169 Yersinia_enterocolitica_NW1 Yersinia_enterocolitica_FORC065 Yersinia_frederiksenii_Y225 Yersinia_kristensenii_Y231 Yersinia_rochesterensis_ATCC_33639 Yersinia_rochesterensis_ATCC_BAA-2637 Yersinia_intermedia_SCPM-O-B-9106_C-191 Yersinia_kristensenii_2012N-4030 Yersinia_hibernica_CFS1934 Yersinia_hibernica_LC20 Yersinia_canariae_NCTC_14382 Yersinia_frederiksenii_FDAARGOS_418 Yersinia_alsatica_SCPM-O-B-7604 Yersinia_rohdei_YRA Yersinia_massiliensis_GTA Yersinia_massiliensis_2011N-4075 Yersinia_frederiksenii_FDAARGOS_417 Yersinia_intermedia_SCPM-O-B-8026_C-146 Yersinia_sp_KBS0713 Yersinia_bercovieri_ATCC_43970 Yersinia_aleksiciae_159 Yersinia_mollaretii_ATCC_43969 Yersinia_intermedia_FDAARGOS_729 Yersinia_intermedia_FDAARGOS_730 Yersinia_intermedia_NCTC11469 Yersinia_intermedia_FDAARGOS_358 Yersinia_sp_FDAARGOS_228 Yersinia_intermedia_Y228 Yersinia_intermedia_N6293 Yersinia_intermedia_SCPM-O-B-10209_333 Yersinia_aldovae_670-83 Yersinia_ruckeri_NHV_3758 Yersinia_ruckeri_NVI-10705 Yersinia_ruckeri_NVI-1292 Yersinia_ruckeri_NVI-4570 Yersinia_ruckeri_NVI-6614 Yersinia_ruckeri_NVI-11267 Yersinia_ruckeri_NVI-11294 Yersinia_ruckeri_NVI-10571 Yersinia_ruckeri_NVI-8524 Yersinia_ruckeri_NVI-1176 Yersinia_ruckeri_NVI-701 Yersinia_ruckeri_17Y0412 Yersinia_ruckeri_17Y0414 Yersinia_ruckeri_NVI-492 Yersinia_ruckeri_NVI-9681 Yersinia_ruckeri_SC09 Yersinia_ruckeri_17Y0157 Yersinia_ruckeri_17Y0189 Yersinia_ruckeri_17Y0153 Yersinia_ruckeri_17Y0155 Yersinia_ruckeri_KMM821 Yersinia_ruckeri_16Y0180 Yersinia_ruckeri_NVI-11050 Yersinia_ruckeri_NVI-11076 Yersinia_ruckeri_QMA0440 Yersinia_ruckeri_Big_Creek_74 Yersinia_ruckeri_NVI-5089 Yersinia_ruckeri_NVI-10587 Yersinia_ruckeri_NVI-4840 Yersinia_ruckeri_NVI-4479 Yersinia_ruckeri_17Y0161 Yersinia_ruckeri_17Y0163 Yersinia_ruckeri_NVI-11073 Yersinia_ruckeri_NVI-11065 Yersinia_ruckeri_17Y0159 Yersinia_ruckeri_NVI-8270 Yersinia_ruckeri_YRB Yersinia_entomophaga_MH96; do
    mlst ${sample}.fna >> ../mlst/all.txt;
    done
    
    #gene-M486_RS20950
    #M486_RS20950
    
    #extract CDS with locus_tag from genbank file

        #cut -d’ ‘ -f1 ../assembly/${sample}.fna > ../assembly/${sample}.fasta; #cat ${sample}.gff ../assembly/${sample}.fasta > ../prokkaplus/$(echo $sample | cut -d’‘ -f3- | tr ” ” “_”).gff; #sed -i ‘s/###/##FASTA/g’ ../prokkaplus/$(echo $sample | cut -d’‘ -f3- | tr ” ” “_”).gff;

  2. (important since only with the modification we can track the Gene ID) The step processes GFF files containing gene annotations for a set of samples in the directory prokka. The primary goal is to modify the GFF files and create new ones with specific changes and to save them in the directory prokka_plus. The script operates on each sample one by one, and for each sample, it performs the following steps:

    * Replace all occurrences of \tCDS\t with _CDS_ in the original GFF file.
    * Extract all lines containing _CDS_ and save them in a new file with the suffix _CDS.gff.
    * Replace all occurrences of ID= with ID_old= in the new _CDS.gff file.
    * Cut the second field (delimited by ;) from the _CDS.gff file and save it in a new file with the suffix _CDS_f2.
    * Replace all occurrences of Parent=gene- with ID= in the _CDS_f2 file.
    * Paste the contents of the _CDS.gff and _CDS_f2 files side by side, with a ; delimiter, and save the result in a new file with the suffix _CDS_.gff.
    * Run the enum.py script on the _CDS_.gff file to add line numbers at the end, and save the result in a new file with the suffix _CDS__.gff.
        import sys
    
        if len(sys.argv) < 2:
            print("Please provide a filename as an argument.")
            sys.exit(1)
    
        filename = sys.argv[1]
    
        try:
            with open(filename) as f:
                for i, line in enumerate(f):
                    print(f"{line.strip()}_{i+1}")
        except FileNotFoundError:
            print(f"File {filename} not found.")
    * Extract all lines from the original GFF file that do not contain _CDS_ and save them in a new file with the suffix _nonCDS.gff.
    * Remove all lines containing ### from the _nonCDS.gff file and save the result in a new file with the suffix _nonCDS_.gff.
    * Concatenate the contents of the _nonCDS_.gff and _CDS__.gff files and save the result in a new file with the suffix _nonCDS_CDS.gff.
    * Replace all occurrences of _CDS_ with \tCDS\t in the _nonCDS_CDS.gff file.
    * Append the string ##FASTA to the end of the _nonCDS_CDS.gff file.
    * Modify the FASTA file associated with the sample by replacing the first field (delimited by a space) with the corresponding sample name.
    * Concatenate the modified GFF file (_nonCDS_CDS.gff) and the modified FASTA file, and save the result in the ../prokka_plus/ directory with a new name based on the sample name.
    * After processing all samples, the script removes intermediate files generated during the process.
    
    # ERROR: Input file contains duplicate gene IDs, attempting to fix by adding a unique suffix, new GFF in the fixed_input_files directory: /mnt/Samsung_T5/Data_Gunnar_Yersiniomics/prokka_plus/1045.gff
    #To Debug the error above, perform the data as follows.
    
    for sample in Yersinia_pestis_1045 Yersinia_pestis_SCPM-O-B-6291_C-25 Yersinia_pestis_2944 Yersinia_pestis_KIM10+ Yersinia_pestis_M-1482 Yersinia_pestis_KIM5 Yersinia_pestis_C-781 Yersinia_pestis_14D Yersinia_pestis_KM_567 Yersinia_pestis_M-1770 Yersinia_pestis_C-792 Yersinia_pestis_M2086 Yersinia_pestis_Harbin_35 Yersinia_pestis_Nicholisk_41 Yersinia_pestis_Harbin_35_bis Yersinia_pestis_SCPM-O-B-5935_I-1996 Yersinia_pestis_I-1252 Yersinia_pestis_FDAARGOS_603 Yersinia_pestis_195P Yersinia_pestis_Nepal516 Yersinia_pestis_S19960127 Yersinia_pestis_SCPM-O-B-6530 Yersinia_pestis_C-783 Yersinia_pestis_A1122 Yersinia_pestis_Cadman Yersinia_pestis_A1122_bis Yersinia_pestis_CO92_pgm-_pPCP1- Yersinia_pestis_CO92 Yersinia_pestis_Shasta Yersinia_pestis_Dodson Yersinia_pestis_El_Dorado Yersinia_pestis_EV76-CN Yersinia_pestis_EV_NIIEG Yersinia_pestis_Java9 Yersinia_pestis_PBM19 Yersinia_pestis_20 Yersinia_pestis_D182038 Yersinia_pestis_D106004 Yersinia_pestis_Z176003 Yersinia_pestis_Antiqua_bis Yersinia_pestis_FDAARGOS_601 Yersinia_pestis_Antiqua Yersinia_pestis_Nairobi Yersinia_pestis_M2085 Yersinia_pestis_SCPM-O-B-5942_I-2638 Yersinia_pestis_M2029 Yersinia_pestis_SCPM-O-DNA-18_I-3113 Yersinia_pestis_94 Yersinia_pestis_R Yersinia_pestis_790 Yersinia_pestis_SCPM-O-B-6899_231 Yersinia_pestis_FDAARGOS_602 Yersinia_pestis_Pestoides_B Yersinia_pestis_M-1974 Yersinia_pestis_91001 Yersinia_pestis_Angola Yersinia_pestis_Angola_bis Yersinia_pestis_3770 Yersinia_pestis_1412 Yersinia_pestis_1413 Yersinia_pestis_8787 Yersinia_pestis_3067 Yersinia_pestis_Pestoides_G Yersinia_pestis_Pestoides_F Yersinia_pestis_Pestoides_F_bis Yersinia_pestis_1522 Yersinia_pseudotuberculosis_FDAARGOS_582 Yersinia_pseudotuberculosis_NZYP4713 Yersinia_pseudotuberculosis_NCTC8480  Yersinia_pseudotuberculosis_PB1+_bis Yersinia_pseudotuberculosis_MD67 Yersinia_pseudotuberculosis_NCTC10217 Yersinia_pseudotuberculosis_NCTC10275 Yersinia_pseudotuberculosis_1 Yersinia_pseudotuberculosis_IP32953 Yersinia_pseudotuberculosis_IP32953_bis Yersinia_pseudotuberculosis_FDAARGOS_583 Yersinia_pseudotuberculosis_FDAARGOS_581 Yersinia_pseudotuberculosis_ATCC_6904 Yersinia_pseudotuberculosis_EP2+ Yersinia_pseudotuberculosis_IP31758 Yersinia_pseudotuberculosis_598 Yersinia_pseudotuberculosis_PA3606 Yersinia_pseudotuberculosis_FDAARGOS_665 Yersinia_pseudotuberculosis_FDAARGOS_584 Yersinia_pseudotuberculosis_YPIII_bis  Yersinia_pseudotuberculosis_FDAARGOS_579 Yersinia_pseudotuberculosis_IP2666pIB1 Yersinia_pseudotuberculosis_FDAARGOS_342 Yersinia_pseudotuberculosis_FDAARGOS_580 Yersinia_pseudotuberculosis_NCTC3571 Yersinia_similis_228 Yersinia_enterocolitica_NCTC13629 Yersinia_enterocolitica_MGYG-HGUT-02335 Yersinia_enterocolitica_Y1 Yersinia_enterocolitica_Y11 Yersinia_enterocolitica_NCTC13769 Yersinia_enterocolitica_FDAARGOS_1082 Yersinia_enterocolitica_2516-87 Yersinia_enterocolitica_KNG22703 Yersinia_enterocolitica_1055Rr Yersinia_enterocolitica_FDAARGOS_1090 Yersinia_enterocolitica_YE1 Yersinia_enterocolitica_YE3 Yersinia_enterocolitica_YE6 Yersinia_enterocolitica_YE7 Yersinia_enterocolitica_YE5 Yersinia_enterocolitica_YE165 Yersinia_enterocolitica_8081 Yersinia_enterocolitica_8081_bis Yersinia_enterocolitica_NCTC12982 Yersinia_enterocolitica_WA Yersinia_enterocolitica_NW57 Yersinia_enterocolitica_NW117 Yersinia_enterocolitica_NW51 Yersinia_enterocolitica_NW56 Yersinia_enterocolitica_NW115 Yersinia_enterocolitica_NW67 Yersinia_enterocolitica_FORC_002 Yersinia_enterocolitica_FORC_002_bis Yersinia_enterocolitica_NW66 Yersinia_enterocolitica_MP98 Yersinia_enterocolitica_Gp259 Yersinia_enterocolitica_FORC066 Yersinia_enterocolitica_Gp2 Yersinia_enterocolitica_str_YE5303 Yersinia_enterocolitica_Gp200 Yersinia_enterocolitica_NW116 Yersinia_enterocolitica_Gp169 Yersinia_enterocolitica_NW1 Yersinia_enterocolitica_FORC065 Yersinia_frederiksenii_Y225 Yersinia_kristensenii_Y231 Yersinia_rochesterensis_ATCC_33639 Yersinia_rochesterensis_ATCC_BAA-2637 Yersinia_intermedia_SCPM-O-B-9106_C-191 Yersinia_kristensenii_2012N-4030 Yersinia_hibernica_CFS1934 Yersinia_hibernica_LC20 Yersinia_canariae_NCTC_14382 Yersinia_frederiksenii_FDAARGOS_418 Yersinia_alsatica_SCPM-O-B-7604 Yersinia_rohdei_YRA Yersinia_massiliensis_GTA Yersinia_massiliensis_2011N-4075 Yersinia_frederiksenii_FDAARGOS_417 Yersinia_intermedia_SCPM-O-B-8026_C-146 Yersinia_sp_KBS0713 Yersinia_bercovieri_ATCC_43970 Yersinia_aleksiciae_159 Yersinia_mollaretii_ATCC_43969 Yersinia_intermedia_FDAARGOS_729 Yersinia_intermedia_FDAARGOS_730 Yersinia_intermedia_NCTC11469 Yersinia_intermedia_FDAARGOS_358 Yersinia_sp_FDAARGOS_228 Yersinia_intermedia_Y228 Yersinia_intermedia_N6293 Yersinia_intermedia_SCPM-O-B-10209_333 Yersinia_aldovae_670-83 Yersinia_ruckeri_NHV_3758 Yersinia_ruckeri_NVI-10705 Yersinia_ruckeri_NVI-1292 Yersinia_ruckeri_NVI-4570 Yersinia_ruckeri_NVI-6614 Yersinia_ruckeri_NVI-11267 Yersinia_ruckeri_NVI-11294 Yersinia_ruckeri_NVI-10571 Yersinia_ruckeri_NVI-8524 Yersinia_ruckeri_NVI-1176 Yersinia_ruckeri_NVI-701 Yersinia_ruckeri_17Y0412 Yersinia_ruckeri_17Y0414 Yersinia_ruckeri_NVI-492 Yersinia_ruckeri_NVI-9681 Yersinia_ruckeri_SC09 Yersinia_ruckeri_17Y0157 Yersinia_ruckeri_17Y0189 Yersinia_ruckeri_17Y0153 Yersinia_ruckeri_17Y0155 Yersinia_ruckeri_KMM821 Yersinia_ruckeri_16Y0180 Yersinia_ruckeri_NVI-11050 Yersinia_ruckeri_NVI-11076 Yersinia_ruckeri_QMA0440 Yersinia_ruckeri_Big_Creek_74 Yersinia_ruckeri_NVI-5089 Yersinia_ruckeri_NVI-10587 Yersinia_ruckeri_NVI-4840 Yersinia_ruckeri_NVI-4479 Yersinia_ruckeri_17Y0161 Yersinia_ruckeri_17Y0163 Yersinia_ruckeri_NVI-11073 Yersinia_ruckeri_NVI-11065 Yersinia_ruckeri_17Y0159 Yersinia_ruckeri_NVI-8270 Yersinia_ruckeri_YRB Yersinia_entomophaga_MH96; do
        for sample in Yersinia_pestis_1045 Yersinia_pestis_SCPM-O-B-6291_C-25 Yersinia_pestis_2944 Yersinia_pestis_KIM10+ Yersinia_pestis_M-1482; do
          sed -i 's/\tCDS\t/_CDS_/g' ${sample}.gff
          grep "_CDS_" ${sample}.gff > ${sample}_CDS.gff
          sed -i 's/ID=/ID_old=/g' ${sample}_CDS.gff
          cut -d';' -f2 ${sample}_CDS.gff > ${sample}_CDS_f2
          sed -i 's/Parent=gene-/ID=/g' ${sample}_CDS_f2
          paste -d';' ${sample}_CDS.gff ${sample}_CDS_f2 > ${sample}_CDS_.gff
          python enum.py ${sample}_CDS_.gff > ${sample}_CDS__.gff   # add a line number to end to avoid the sameple Gene_ID
    
          grep -v "_CDS_" ${sample}.gff > ${sample}_nonCDS.gff
          grep -v "###" ${sample}_nonCDS.gff > ${sample}_nonCDS_.gff
    
          cat ${sample}_nonCDS_.gff ${sample}_CDS__.gff > ${sample}_nonCDS_CDS.gff
          sed -i 's/_CDS_/\tCDS\t/g' ${sample}_nonCDS_CDS.gff
          echo "##FASTA" >> ${sample}_nonCDS_CDS.gff
    
          cut -d' ' -f1 ../assembly/${sample}.fna > ../assembly/${sample}.fasta;
          cat ${sample}_nonCDS_CDS.gff ../assembly/${sample}.fasta > ../prokka_plus/$(echo $sample | cut -d'_' -f3- | tr " " "_").gff;
        done
    
        rm *_CDS.gff *_CDS_f2 *_CDS_.gff *_CDS__.gff *_nonCDS.gff *_nonCDS_.gff
    
    #for sample in Yersinia_pestis_1045 Yersinia_pestis_SCPM-O-B-6291_C-25 ...; do
    #echo $sample | cut -d'_' -f3- | tr " " "_" >> temp
    #done
  3. After standand running of bacto-pipeline. Then we run Roary in the step, a tool for pan-genome analysis. It takes annotated bacterial genomes in GFF3 format as input and clusters the genes based on sequence similarity.

    roary -p 4 -f ./roary -i 95 -cd 99 -s -e -n -v  prokka_plus/1045.gff prokka_plus/SCPM-O-B-6291_C-25.gff prokka_plus/2944.gff prokka_plus/KIM10+.gff
    
    roary -p 4 -f ./roary -i 50 -cd 99 -s -e -n -v  prokka_plus/1045.gff prokka_plus/SCPM-O-B-6291_C-25.gff prokka_plus/2944.gff prokka_plus/KIM10+.gff prokka_plus/M-1482.gff prokka_plus/KIM5.gff prokka_plus/C-781.gff prokka_plus/14D.gff prokka_plus/KM_567.gff prokka_plus/M-1770.gff prokka_plus/C-792.gff prokka_plus/M2086.gff prokka_plus/Harbin_35.gff prokka_plus/Nicholisk_41.gff prokka_plus/Harbin_35_bis.gff prokka_plus/SCPM-O-B-5935_I-1996.gff prokka_plus/I-1252.gff prokka_plus/FDAARGOS_603.gff prokka_plus/195P.gff prokka_plus/Nepal516.gff prokka_plus/S19960127.gff prokka_plus/SCPM-O-B-6530.gff prokka_plus/C-783.gff prokka_plus/A1122.gff prokka_plus/Cadman.gff prokka_plus/A1122_bis.gff prokka_plus/CO92_pgm-_pPCP1-.gff prokka_plus/CO92.gff prokka_plus/Shasta.gff prokka_plus/Dodson.gff prokka_plus/El_Dorado.gff prokka_plus/EV76-CN.gff prokka_plus/EV_NIIEG.gff prokka_plus/Java9.gff prokka_plus/PBM19.gff prokka_plus/20.gff prokka_plus/D182038.gff prokka_plus/D106004.gff prokka_plus/Z176003.gff prokka_plus/Antiqua_bis.gff prokka_plus/FDAARGOS_601.gff prokka_plus/Antiqua.gff prokka_plus/Nairobi.gff prokka_plus/M2085.gff prokka_plus/SCPM-O-B-5942_I-2638.gff prokka_plus/M2029.gff prokka_plus/SCPM-O-DNA-18_I-3113.gff prokka_plus/94.gff prokka_plus/R.gff prokka_plus/790.gff prokka_plus/SCPM-O-B-6899_231.gff prokka_plus/FDAARGOS_602.gff prokka_plus/Pestoides_B.gff prokka_plus/M-1974.gff prokka_plus/91001.gff prokka_plus/Angola.gff prokka_plus/Angola_bis.gff prokka_plus/3770.gff prokka_plus/1412.gff prokka_plus/1413.gff prokka_plus/8787.gff prokka_plus/3067.gff prokka_plus/Pestoides_G.gff prokka_plus/Pestoides_F.gff prokka_plus/Pestoides_F_bis.gff prokka_plus/1522.gff prokka_plus/FDAARGOS_582.gff prokka_plus/NZYP4713.gff prokka_plus/NCTC8480.gff prokka_plus/PB1+_bis.gff prokka_plus/MD67.gff prokka_plus/NCTC10217.gff prokka_plus/NCTC10275.gff prokka_plus/1.gff prokka_plus/IP32953.gff prokka_plus/IP32953_bis.gff prokka_plus/FDAARGOS_583.gff prokka_plus/FDAARGOS_581.gff prokka_plus/ATCC_6904.gff prokka_plus/EP2+.gff prokka_plus/IP31758.gff prokka_plus/598.gff prokka_plus/PA3606.gff prokka_plus/FDAARGOS_665.gff prokka_plus/FDAARGOS_584.gff prokka_plus/YPIII_bis.gff prokka_plus/FDAARGOS_579.gff prokka_plus/IP2666pIB1.gff prokka_plus/FDAARGOS_342.gff prokka_plus/FDAARGOS_580.gff prokka_plus/NCTC3571.gff prokka_plus/228.gff prokka_plus/NCTC13629.gff prokka_plus/MGYG-HGUT-02335.gff prokka_plus/Y1.gff prokka_plus/Y11.gff prokka_plus/NCTC13769.gff prokka_plus/FDAARGOS_1082.gff prokka_plus/2516-87.gff prokka_plus/KNG22703.gff prokka_plus/1055Rr.gff prokka_plus/FDAARGOS_1090.gff prokka_plus/YE1.gff prokka_plus/YE3.gff prokka_plus/YE6.gff prokka_plus/YE7.gff prokka_plus/YE5.gff prokka_plus/YE165.gff prokka_plus/8081.gff prokka_plus/8081_bis.gff prokka_plus/NCTC12982.gff prokka_plus/WA.gff prokka_plus/NW57.gff prokka_plus/NW117.gff prokka_plus/NW51.gff prokka_plus/NW56.gff prokka_plus/NW115.gff prokka_plus/NW67.gff prokka_plus/FORC_002.gff prokka_plus/FORC_002_bis.gff prokka_plus/NW66.gff prokka_plus/MP98.gff prokka_plus/Gp259.gff prokka_plus/FORC066.gff prokka_plus/Gp2.gff prokka_plus/str_YE5303.gff prokka_plus/Gp200.gff prokka_plus/NW116.gff prokka_plus/Gp169.gff prokka_plus/NW1.gff prokka_plus/FORC065.gff prokka_plus/Y225.gff prokka_plus/Y231.gff prokka_plus/ATCC_33639.gff prokka_plus/ATCC_BAA-2637.gff prokka_plus/SCPM-O-B-9106_C-191.gff prokka_plus/2012N-4030.gff prokka_plus/CFS1934.gff prokka_plus/LC20.gff prokka_plus/NCTC_14382.gff prokka_plus/FDAARGOS_418.gff prokka_plus/SCPM-O-B-7604.gff prokka_plus/YRA.gff prokka_plus/GTA.gff prokka_plus/2011N-4075.gff prokka_plus/FDAARGOS_417.gff prokka_plus/SCPM-O-B-8026_C-146.gff prokka_plus/KBS0713.gff prokka_plus/ATCC_43970.gff prokka_plus/159.gff prokka_plus/ATCC_43969.gff prokka_plus/FDAARGOS_729.gff prokka_plus/FDAARGOS_730.gff prokka_plus/NCTC11469.gff prokka_plus/FDAARGOS_358.gff prokka_plus/FDAARGOS_228.gff prokka_plus/Y228.gff prokka_plus/N6293.gff prokka_plus/SCPM-O-B-10209_333.gff prokka_plus/670-83.gff prokka_plus/NHV_3758.gff prokka_plus/NVI-10705.gff prokka_plus/NVI-1292.gff prokka_plus/NVI-4570.gff prokka_plus/NVI-6614.gff prokka_plus/NVI-11267.gff prokka_plus/NVI-11294.gff prokka_plus/NVI-10571.gff prokka_plus/NVI-8524.gff prokka_plus/NVI-1176.gff prokka_plus/NVI-701.gff prokka_plus/17Y0412.gff prokka_plus/17Y0414.gff prokka_plus/NVI-492.gff prokka_plus/NVI-9681.gff prokka_plus/SC09.gff prokka_plus/17Y0157.gff prokka_plus/17Y0189.gff prokka_plus/17Y0153.gff prokka_plus/17Y0155.gff prokka_plus/KMM821.gff prokka_plus/16Y0180.gff prokka_plus/NVI-11050.gff prokka_plus/NVI-11076.gff prokka_plus/QMA0440.gff prokka_plus/Big_Creek_74.gff prokka_plus/NVI-5089.gff prokka_plus/NVI-10587.gff prokka_plus/NVI-4840.gff prokka_plus/NVI-4479.gff prokka_plus/17Y0161.gff prokka_plus/17Y0163.gff prokka_plus/NVI-11073.gff prokka_plus/NVI-11065.gff prokka_plus/17Y0159.gff prokka_plus/NVI-8270.gff prokka_plus/YRB.gff prokka_plus/MH96.gff
    
    #DEL makeblastdb -in fna -dbtype 'nucl' -out fna.db 
    #DELblastn -db fna.db -query yopK.fasta -out yopK_on_fna.blastn -evalue 10000  -num_threads 15 -outfmt 6 
  4. generate yop*_seq.txt from roary: This step extracts the coding sequences (CDS) of specific genes from multiple genome files and saves them to an output file. Start-files: roary/pan_genome_reference.fa and roary/gene_presence_absence.csv. For example for yopM.

    grep "yopM" roary/gene_presence_absence.csv
    #6+19+45=70 --> 71
    "yopM","","type III secretion system effector YopM","45","45","1","","","","","","1229","1229","1229","","M486_RS20920_3990","","M479_RS01070_4055","M480_RS01170_4076","","M481_RS01115_4071","","","","","","","","","","","","","LDH65_RS21345_4177","","","","","M478_RS01000_4055","M482_RS01070_4063","M483_RS00915_4013","","","M477_RS21610_4128","","","M484_RS01125_4011","","LDH63_RS21760_4259","","","","","","","","","","YPA_RS22550_4200","CH58_RS00945_4248","","","","","","YPO_RS00170_4130","AK38_RS00930_4114","BAY22_RS21640_4174","YPD4_RS21505_4104","YPD8_RS21525_4060","CH61_RS00195_4143","BZ20_RS00435_4174","M0M60_RS21870_4286","","CH46_RS00070_4122","","","","","","","","EGX53_RS00030_4033","EGX52_RS00260_4348","","","","","EGX46_RS00245_4205","","EGX74_RS00040_4070","","","","","","","","","","","","","YPC_RS21075_4024","CH55_RS00770_4123","","DN756_RS21785_4075","","","","CH62_RS00690_4176","","","CH44_RS00795_4078","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","CH63_RS00700_4106","","","CH59_RS00970_4231","","YPDSF_RS21140_4036","BZ18_RS00325_4042","CH43_RS00040_3994","","LDH64_RS21810_4270","S96127_RS00100_4096","","","GCK71_RS22420_4113","GD372_RS22475_4112","","DJY80_RS22415_4098","GCK69_RS22480_4113","","","","GCK70_RS22160_4053","BZ15_RS00325_4183","","","","","","","","","","","","","","","","YPZ3_RS21220_4056",""
    "group_5673","yopM","type III secretion system effector YopM","19","19","1","","","","","","1103","1103","1103","","","YE105_RS20595_4018","","","","","","","","","","","","","","","","","","","","","CH48_RS00390_4060","","","","YP598_RS21115_4110","","","YE_RS21175_4135","CH49_RS00235_4177","","YP_RS21285_4111","","","","","","","","","YPANGOLA_RS22070_4036","CH56_RS22160_4084","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","XM56_RS20545_4037","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","BZ19_RS21445_4113","","","CH60_RS01070_4100","","","","","","","","","","","","","","","","","","","","YEY1_RS21430_4040","Y11_RS21100_4128","","","","BFS78_RS21580_4258","BB936_RS22285_4398","BED35_RS00500_4353","BED32_RS00030_4182","BED33_RS21910_4325","BED34_RS22270_4407","","","","",""
    "group_23005","yopM","type III secretion system effector YopM","6","6","1","","","","","","1589","1589","1589","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","EGX47_RS00105_4453","EGX44_RS00020_4153","EGX39_RS00330_3982","","","","","","","","","","","","","","","","","","","","","","YPTB_RS21675_4159","BZ17_RS00175_4115","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","BN7064_RS22100_4159","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","",""
    
    > yopM_seq.txt
    for gene_id in M486_RS20920_3990 YE105_RS20595_4018 M479_RS01070_4055 M480_RS01170_4076  M481_RS01115_4071             LDH65_RS21345_4177    CH48_RS00390_4060 M478_RS01000_4055 M482_RS01070_4063 M483_RS00915_4013 YP598_RS21115_4110  M477_RS21610_4128 YE_RS21175_4135 CH49_RS00235_4177 M484_RS01125_4011 YP_RS21285_4111 LDH63_RS21760_4259        YPANGOLA_RS22070_4036 CH56_RS22160_4084 YPA_RS22550_4200 CH58_RS00945_4248      YPO_RS00170_4130 AK38_RS00930_4114 BAY22_RS21640_4174 YPD4_RS21505_4104 YPD8_RS21525_4060 CH61_RS00195_4143 BZ20_RS00435_4174 M0M60_RS21870_4286  CH46_RS00070_4122        EGX53_RS00030_4033 EGX52_RS00260_4348 EGX47_RS00105_4453 EGX44_RS00020_4153 EGX39_RS00330_3982  EGX46_RS00245_4205  EGX74_RS00040_4070             YPC_RS21075_4024 CH55_RS00770_4123  DN756_RS21785_4075  YPTB_RS21675_4159 BZ17_RS00175_4115 CH62_RS00690_4176   CH44_RS00795_4078   XM56_RS20545_4037                                                     BN7064_RS22100_4159   CH63_RS00700_4106 BZ19_RS21445_4113  CH59_RS00970_4231 CH60_RS01070_4100 YPDSF_RS21140_4036 BZ18_RS00325_4042 CH43_RS00040_3994  LDH64_RS21810_4270 S96127_RS00100_4096   GCK71_RS22420_4113 GD372_RS22475_4112  DJY80_RS22415_4098 GCK69_RS22480_4113    GCK70_RS22160_4053 BZ15_RS00325_4183 CH47_RS00140_4080 YEY1_RS21430_4040 Y11_RS21100_4128    BFS78_RS21580_4258 BB936_RS22285_4398 BED35_RS00500_4353 BED32_RS00030_4182 BED33_RS21910_4325 BED34_RS22270_4407    YPZ3_RS21220_4056; do
    for gbff in  Yersinia_massiliensis_2011N-4075/GCF_013282765.1_ASM1328276v1/GCF_013282765.1_ASM1328276v1_genomic.gbff.gz Yersinia_pestis_EV_NIIEG/GCF_000590535.2_ASM59053v2/GCF_000590535.2_ASM59053v2_genomic.gbff.gz Yersinia_pestis_Shasta/GCF_000834335.1_ASM83433v1/GCF_000834335.1_ASM83433v1_genomic.gbff.gz Yersinia_ruckeri_NVI-492/GCF_023212565.2_ASM2321256v2/GCF_023212565.2_ASM2321256v2_genomic.gbff.gz Yersinia_pestis_Pestoides_G/GCF_000834985.1_ASM83498v1/GCF_000834985.1_ASM83498v1_genomic.gbff.gz Yersinia_pestis_Antiqua_bis/GCF_000834825.1_ASM83482v1/GCF_000834825.1_ASM83482v1_genomic.gbff.gz Yersinia_pestis_91001/GCF_000007885.1_ASM788v1/GCF_000007885.1_ASM788v1_genomic.gbff.gz Yersinia_intermedia_Y228/GCF_000834515.1_ASM83451v1/GCF_000834515.1_ASM83451v1_genomic.gbff.gz Yersinia_pestis_Java9/GCF_000834905.1_ASM83490v1/GCF_000834905.1_ASM83490v1_genomic.gbff.gz Yersinia_pseudotuberculosis_IP32953_bis/GCF_000834295.1_ASM83429v1/GCF_000834295.1_ASM83429v1_genomic.gbff.gz Yersinia_pseudotuberculosis_YPIII_bis/GCF_000834375.1_ASM83437v1/GCF_000834375.1_ASM83437v1_genomic.gbff.gz Yersinia_enterocolitica_8081_bis/GCF_000834795.1_ASM83479v1/GCF_000834795.1_ASM83479v1_genomic.gbff.gz Yersinia_sp_FDAARGOS_228/GCF_002073315.2_ASM207331v2/GCF_002073315.2_ASM207331v2_genomic.gbff.gz Yersinia_enterocolitica_Gp169/GCF_025758435.1_ASM2575843v1/GCF_025758435.1_ASM2575843v1_genomic.gbff.gz Yersinia_pestis_195P/GCF_002005285.1_ASM200528v1/GCF_002005285.1_ASM200528v1_genomic.gbff.gz Yersinia_frederiksenii_FDAARGOS_418/GCF_002591195.1_ASM259119v1/GCF_002591195.1_ASM259119v1_genomic.gbff.gz Yersinia_pseudotuberculosis_NCTC3571/GCF_900636705.1_43908_A02/GCF_900636705.1_43908_A02_genomic.gbff.gz Yersinia_enterocolitica_FORC_002/GCF_000987925.1_ASM98792v1/GCF_000987925.1_ASM98792v1_genomic.gbff.gz Yersinia_ruckeri_NVI-1292/GCF_026435275.1_ASM2643527v1/GCF_026435275.1_ASM2643527v1_genomic.gbff.gz Yersinia_pestis_3067/GCF_001188795.1_ASM118879v1/GCF_001188795.1_ASM118879v1_genomic.gbff.gz Yersinia_pestis_M2086/GCF_015336695.1_ASM1533669v1/GCF_015336695.1_ASM1533669v1_genomic.gbff.gz Yersinia_ruckeri_16Y0180/GCF_021399215.1_ASM2139921v1/GCF_021399215.1_ASM2139921v1_genomic.gbff.gz Yersinia_pestis_2944/GCF_001188815.1_ASM118881v1/GCF_001188815.1_ASM118881v1_genomic.gbff.gz Yersinia_rochesterensis_ATCC_BAA-2637/GCF_003600645.1_ASM360064v1/GCF_003600645.1_ASM360064v1_genomic.gbff.gz Yersinia_pestis_Z176003/GCF_000022845.1_ASM2284v1/GCF_000022845.1_ASM2284v1_genomic.gbff.gz Yersinia_intermedia_SCPM-O-B-8026_C-146/GCF_026183385.1_ASM2618338v1/GCF_026183385.1_ASM2618338v1_genomic.gbff.gz Yersinia_enterocolitica_YE5/GCF_001708615.1_ASM170861v1/GCF_001708615.1_ASM170861v1_genomic.gbff.gz Yersinia_enterocolitica_YE6/GCF_001708595.1_ASM170859v1/GCF_001708595.1_ASM170859v1_genomic.gbff.gz Yersinia_pestis_CO92_pgm-_pPCP1-/GCF_001293415.1_ASM129341v1/GCF_001293415.1_ASM129341v1_genomic.gbff.gz Yersinia_pestis_1412/GCF_001188695.1_ASM118869v1/GCF_001188695.1_ASM118869v1_genomic.gbff.gz Yersinia_pestis_El_Dorado/GCF_000834495.1_ASM83449v1/GCF_000834495.1_ASM83449v1_genomic.gbff.gz Yersinia_enterocolitica_KNG22703/GCF_001305635.1_ASM130563v1/GCF_001305635.1_ASM130563v1_genomic.gbff.gz Yersinia_pestis_M-1770/GCF_015337825.2_ASM1533782v2/GCF_015337825.2_ASM1533782v2_genomic.gbff.gz Yersinia_enterocolitica_MP98/GCF_025758515.1_ASM2575851v1/GCF_025758515.1_ASM2575851v1_genomic.gbff.gz Yersinia_enterocolitica_NCTC13629/GCF_900635745.1_32868_F02/GCF_900635745.1_32868_F02_genomic.gbff.gz Yersinia_pestis_94/GCF_024498395.1_ASM2449839v1/GCF_024498395.1_ASM2449839v1_genomic.gbff.gz Yersinia_kristensenii_Y231/GCF_000834865.1_ASM83486v1/GCF_000834865.1_ASM83486v1_genomic.gbff.gz Yersinia_pestis_C-783/GCF_015337285.1_ASM1533728v1/GCF_015337285.1_ASM1533728v1_genomic.gbff.gz Yersinia_pseudotuberculosis_NCTC8480/GCF_900635715.1_32473_H02/GCF_900635715.1_32473_H02_genomic.gbff.gz Yersinia_enterocolitica_NW57/GCF_025758475.1_ASM2575847v1/GCF_025758475.1_ASM2575847v1_genomic.gbff.gz Yersinia_enterocolitica_YE1/GCF_001708635.1_ASM170863v1/GCF_001708635.1_ASM170863v1_genomic.gbff.gz Yersinia_pestis_790/GCF_001188675.1_ASM118867v1/GCF_001188675.1_ASM118867v1_genomic.gbff.gz Yersinia_ruckeri_NVI-11065/GCF_026435655.1_ASM2643565v1/GCF_026435655.1_ASM2643565v1_genomic.gbff.gz Yersinia_pestis_14D/GCF_015159615.2_ASM1515961v2/GCF_015159615.2_ASM1515961v2_genomic.gbff.gz Yersinia_enterocolitica_NW115/GCF_025758655.1_ASM2575865v1/GCF_025758655.1_ASM2575865v1_genomic.gbff.gz Yersinia_enterocolitica_Gp259/GCF_025758265.1_ASM2575826v1/GCF_025758265.1_ASM2575826v1_genomic.gbff.gz Yersinia_enterocolitica_FORC066/GCF_025340245.1_ASM2534024v1/GCF_025340245.1_ASM2534024v1_genomic.gbff.gz Yersinia_pestis_20/GCF_024498415.1_ASM2449841v1/GCF_024498415.1_ASM2449841v1_genomic.gbff.gz Yersinia_pestis_FDAARGOS_602/GCF_003798345.1_ASM379834v1/GCF_003798345.1_ASM379834v1_genomic.gbff.gz Yersinia_aleksiciae_159/GCF_001047675.1_ASM104767v1/GCF_001047675.1_ASM104767v1_genomic.gbff.gz Yersinia_enterocolitica_Gp2/GCF_025758285.1_ASM2575828v1/GCF_025758285.1_ASM2575828v1_genomic.gbff.gz Yersinia_pseudotuberculosis_1/GCF_000834435.1_ASM83443v1/GCF_000834435.1_ASM83443v1_genomic.gbff.gz Yersinia_pestis_3770/GCF_001188775.1_ASM118877v1/GCF_001188775.1_ASM118877v1_genomic.gbff.gz Yersinia_intermedia_FDAARGOS_729/GCF_009730075.1_ASM973007v1/GCF_009730075.1_ASM973007v1_genomic.gbff.gz Yersinia_enterocolitica_NW67/GCF_025758535.1_ASM2575853v1/GCF_025758535.1_ASM2575853v1_genomic.gbff.gz Yersinia_intermedia_SCPM-O-B-10209_333/GCF_026183345.1_ASM2618334v1/GCF_026183345.1_ASM2618334v1_genomic.gbff.gz Yersinia_ruckeri_17Y0414/GCF_021399075.1_ASM2139907v1/GCF_021399075.1_ASM2139907v1_genomic.gbff.gz Yersinia_pestis_SCPM-O-B-6530/GCF_009295985.1_ASM929598v1/GCF_009295985.1_ASM929598v1_genomic.gbff.gz Yersinia_pseudotuberculosis_EP2+/GCF_000834415.1_ASM83441v1/GCF_000834415.1_ASM83441v1_genomic.gbff.gz Yersinia_pestis_KM_567/GCF_015337445.1_ASM1533744v1/GCF_015337445.1_ASM1533744v1_genomic.gbff.gz Yersinia_ruckeri_Big_Creek_74/GCF_000964565.1_ASM96456v1/GCF_000964565.1_ASM96456v1_genomic.gbff.gz Yersinia_intermedia_FDAARGOS_358/GCF_002983625.1_ASM298362v1/GCF_002983625.1_ASM298362v1_genomic.gbff.gz Yersinia_ruckeri_NVI-9681/GCF_023212445.2_ASM2321244v2/GCF_023212445.2_ASM2321244v2_genomic.gbff.gz Yersinia_kristensenii_2012N-4030/GCF_013282785.1_ASM1328278v1/GCF_013282785.1_ASM1328278v1_genomic.gbff.gz Yersinia_ruckeri_17Y0157/GCF_021399195.1_ASM2139919v1/GCF_021399195.1_ASM2139919v1_genomic.gbff.gz Yersinia_ruckeri_NVI-8270/GCF_026435135.1_ASM2643513v1/GCF_026435135.1_ASM2643513v1_genomic.gbff.gz Yersinia_ruckeri_17Y0189/GCF_021399095.1_ASM2139909v1/GCF_021399095.1_ASM2139909v1_genomic.gbff.gz Yersinia_ruckeri_NVI-8524/GCF_026435115.1_ASM2643511v1/GCF_026435115.1_ASM2643511v1_genomic.gbff.gz Yersinia_pestis_M-1482/GCF_015337645.1_ASM1533764v1/GCF_015337645.1_ASM1533764v1_genomic.gbff.gz Yersinia_pestis_Harbin_35_bis/GCF_000834275.1_ASM83427v1/GCF_000834275.1_ASM83427v1_genomic.gbff.gz Yersinia_pseudotuberculosis_NCTC10217/GCF_900635755.1_33467_B01/GCF_900635755.1_33467_B01_genomic.gbff.gz Yersinia_pseudotuberculosis_598/GCF_020889805.1_ASM2088980v1/GCF_020889805.1_ASM2088980v1_genomic.gbff.gz Yersinia_ruckeri_NVI-11267/GCF_026435335.1_ASM2643533v1/GCF_026435335.1_ASM2643533v1_genomic.gbff.gz Yersinia_enterocolitica_NW56/GCF_025758635.1_ASM2575863v1/GCF_025758635.1_ASM2575863v1_genomic.gbff.gz Yersinia_pestis_Angola/GCF_000018805.1_ASM1880v1/GCF_000018805.1_ASM1880v1_genomic.gbff.gz Yersinia_pestis_SCPM-O-DNA-18_I-3113/GCF_009295945.1_ASM929594v1/GCF_009295945.1_ASM929594v1_genomic.gbff.gz Yersinia_enterocolitica_Y11/GCF_000253175.1_ASM25317v1/GCF_000253175.1_ASM25317v1_genomic.gbff.gz Yersinia_pestis_Dodson/GCF_000834775.1_ASM83477v1/GCF_000834775.1_ASM83477v1_genomic.gbff.gz Yersinia_pestis_Cadman/GCF_001693595.1_ASM169359v1/GCF_001693595.1_ASM169359v1_genomic.gbff.gz Yersinia_pestis_KIM5/GCF_000970105.1_ASM97010v1/GCF_000970105.1_ASM97010v1_genomic.gbff.gz Yersinia_ruckeri_NVI-10705/GCF_023212585.2_ASM2321258v2/GCF_023212585.2_ASM2321258v2_genomic.gbff.gz Yersinia_pestis_EV76-CN/GCF_024758685.1_ASM2475868v1/GCF_024758685.1_ASM2475868v1_genomic.gbff.gz Yersinia_intermedia_FDAARGOS_730/GCF_009730055.1_ASM973005v1/GCF_009730055.1_ASM973005v1_genomic.gbff.gz Yersinia_ruckeri_NVI-11073/GCF_026435495.1_ASM2643549v1/GCF_026435495.1_ASM2643549v1_genomic.gbff.gz Yersinia_ruckeri_17Y0161/GCF_021399155.1_ASM2139915v1/GCF_021399155.1_ASM2139915v1_genomic.gbff.gz Yersinia_sp_KBS0713/GCF_005937895.2_ASM593789v2/GCF_005937895.2_ASM593789v2_genomic.gbff.gz Yersinia_pestis_SCPM-O-B-6899_231/GCF_009295925.1_ASM929592v1/GCF_009295925.1_ASM929592v1_genomic.gbff.gz Yersinia_ruckeri_NVI-5089/GCF_026435195.1_ASM2643519v1/GCF_026435195.1_ASM2643519v1_genomic.gbff.gz Yersinia_pestis_Nicholisk_41/GCF_000834885.1_ASM83488v1/GCF_000834885.1_ASM83488v1_genomic.gbff.gz Yersinia_enterocolitica_YE7/GCF_001708555.1_ASM170855v1/GCF_001708555.1_ASM170855v1_genomic.gbff.gz Yersinia_intermedia_SCPM-O-B-9106_C-191/GCF_026183365.1_ASM2618336v1/GCF_026183365.1_ASM2618336v1_genomic.gbff.gz Yersinia_canariae_NCTC_14382/GCF_009831415.1_ASM983141v1/GCF_009831415.1_ASM983141v1_genomic.gbff.gz Yersinia_enterocolitica_YE3/GCF_001708655.1_ASM170865v1/GCF_001708655.1_ASM170865v1_genomic.gbff.gz Yersinia_pseudotuberculosis_NCTC10275/GCF_900637475.1_51108_B01/GCF_900637475.1_51108_B01_genomic.gbff.gz Yersinia_enterocolitica_8081/GCF_000009345.1_ASM934v1/GCF_000009345.1_ASM934v1_genomic.gbff.gz Yersinia_ruckeri_NVI-10571/GCF_026435835.1_ASM2643583v1/GCF_026435835.1_ASM2643583v1_genomic.gbff.gz Yersinia_enterocolitica_2516-87/GCF_000834735.1_ASM83473v1/GCF_000834735.1_ASM83473v1_genomic.gbff.gz Yersinia_frederiksenii_FDAARGOS_417/GCF_002591095.1_ASM259109v1/GCF_002591095.1_ASM259109v1_genomic.gbff.gz Yersinia_pestis_I-1252/GCF_015336465.1_ASM1533646v1/GCF_015336465.1_ASM1533646v1_genomic.gbff.gz Yersinia_ruckeri_17Y0155/GCF_021399235.1_ASM2139923v1/GCF_021399235.1_ASM2139923v1_genomic.gbff.gz Yersinia_pseudotuberculosis_FDAARGOS_665/GCF_008693365.1_ASM869336v1/GCF_008693365.1_ASM869336v1_genomic.gbff.gz Yersinia_alsatica_SCPM-O-B-7604/GCF_025133195.1_ASM2513319v1/GCF_025133195.1_ASM2513319v1_genomic.gbff.gz Yersinia_pseudotuberculosis_PA3606/GCF_000834945.1_ASM83494v1/GCF_000834945.1_ASM83494v1_genomic.gbff.gz Yersinia_pestis_KIM10+/GCF_000006645.1_ASM664v1/GCF_000006645.1_ASM664v1_genomic.gbff.gz Yersinia_ruckeri_NVI-701/GCF_026435155.1_ASM2643515v1/GCF_026435155.1_ASM2643515v1_genomic.gbff.gz Yersinia_enterocolitica_NW117/GCF_025758455.1_ASM2575845v1/GCF_025758455.1_ASM2575845v1_genomic.gbff.gz Yersinia_enterocolitica_FORC065/GCA_025340225.1_ASM2534022v1/GCA_025340225.1_ASM2534022v1_genomic.gbff.gz Yersinia_enterocolitica_NW1/GCF_025758495.1_ASM2575849v1/GCF_025758495.1_ASM2575849v1_genomic.gbff.gz Yersinia_ruckeri_QMA0440/GCF_002192595.1_ASM219259v1/GCF_002192595.1_ASM219259v1_genomic.gbff.gz Yersinia_pseudotuberculosis_FDAARGOS_579/GCF_003798305.1_ASM379830v1/GCF_003798305.1_ASM379830v1_genomic.gbff.gz Yersinia_enterocolitica_1055Rr/GCF_000192105.1_ASM19210v1/GCF_000192105.1_ASM19210v1_genomic.gbff.gz Yersinia_hibernica_CFS1934/GCF_004124235.1_ASM412423v1/GCF_004124235.1_ASM412423v1_genomic.gbff.gz Yersinia_pestis_D106004/GCF_000022805.1_ASM2280v1/GCF_000022805.1_ASM2280v1_genomic.gbff.gz Yersinia_enterocolitica_Y1/GCF_004368055.1_ASM436805v1/GCF_004368055.1_ASM436805v1_genomic.gbff.gz Yersinia_pseudotuberculosis_IP31758/GCF_000016945.1_ASM1694v1/GCF_000016945.1_ASM1694v1_genomic.gbff.gz Yersinia_pestis_Pestoides_F_bis/GCF_000834315.1_ASM83431v1/GCF_000834315.1_ASM83431v1_genomic.gbff.gz Yersinia_pestis_M-1974/GCF_015336865.1_ASM1533686v1/GCF_015336865.1_ASM1533686v1_genomic.gbff.gz Yersinia_ruckeri_NHV_3758/GCF_002442495.2_ASM244249v2/GCF_002442495.2_ASM244249v2_genomic.gbff.gz Yersinia_ruckeri_17Y0163/GCF_021399115.1_ASM2139911v1/GCF_021399115.1_ASM2139911v1_genomic.gbff.gz Yersinia_pseudotuberculosis_MD67/GCF_000834355.1_ASM83435v1/GCF_000834355.1_ASM83435v1_genomic.gbff.gz Yersinia_pestis_D182038/GCF_000022825.1_ASM2282v1/GCF_000022825.1_ASM2282v1_genomic.gbff.gz Yersinia_enterocolitica_FDAARGOS_1090/GCF_016727905.1_ASM1672790v1/GCF_016727905.1_ASM1672790v1_genomic.gbff.gz Yersinia_bercovieri_ATCC_43970/GCF_013282745.1_ASM1328274v1/GCF_013282745.1_ASM1328274v1_genomic.gbff.gz Yersinia_enterocolitica_WA/GCF_000834195.1_ASM83419v1/GCF_000834195.1_ASM83419v1_genomic.gbff.gz Yersinia_ruckeri_NVI-10587/GCF_023212425.2_ASM2321242v2/GCF_023212425.2_ASM2321242v2_genomic.gbff.gz Yersinia_pestis_R/GCF_024498375.1_ASM2449837v1/GCF_024498375.1_ASM2449837v1_genomic.gbff.gz Yersinia_intermedia_N6293/GCF_022637335.1_ASM2263733v1/GCF_022637335.1_ASM2263733v1_genomic.gbff.gz Yersinia_ruckeri_NVI-6614/GCF_026435175.1_ASM2643517v1/GCF_026435175.1_ASM2643517v1_genomic.gbff.gz Yersinia_hibernica_LC20/GCF_000597945.1_ASM59794v2/GCF_000597945.1_ASM59794v2_genomic.gbff.gz Yersinia_ruckeri_17Y0153/GCF_021399175.1_ASM2139917v1/GCF_021399175.1_ASM2139917v1_genomic.gbff.gz Yersinia_aldovae_670-83/GCF_000834395.1_ASM83439v1/GCF_000834395.1_ASM83439v1_genomic.gbff.gz Yersinia_pestis_SCPM-O-B-5935_I-1996/GCF_009295965.1_ASM929596v1/GCF_009295965.1_ASM929596v1_genomic.gbff.gz Yersinia_ruckeri_YRB/GCF_000834255.1_ASM83425v1/GCF_000834255.1_ASM83425v1_genomic.gbff.gz Yersinia_enterocolitica_FORC_002_bis/GCF_001304755.1_ASM130475v1/GCF_001304755.1_ASM130475v1_genomic.gbff.gz Yersinia_pestis_Antiqua/GCF_000013825.1_ASM1382v1/GCF_000013825.1_ASM1382v1_genomic.gbff.gz Yersinia_pestis_Pestoides_B/GCF_000834925.1_ASM83492v1/GCF_000834925.1_ASM83492v1_genomic.gbff.gz Yersinia_pestis_M2085/GCF_015338045.2_ASM1533804v2/GCF_015338045.2_ASM1533804v2_genomic.gbff.gz Yersinia_pestis_CO92/GCF_000009065.1_ASM906v1/GCF_000009065.1_ASM906v1_genomic.gbff.gz Yersinia_ruckeri_17Y0159/GCF_021399135.1_ASM2139913v1/GCF_021399135.1_ASM2139913v1_genomic.gbff.gz Yersinia_enterocolitica_NCTC12982/GCF_901472495.1_32868_C01/GCF_901472495.1_32868_C01_genomic.gbff.gz Yersinia_pestis_SCPM-O-B-5942_I-2638/GCF_009363195.1_ASM936319v1/GCF_009363195.1_ASM936319v1_genomic.gbff.gz Yersinia_pestis_Nepal516/GCF_000013805.1_ASM1380v1/GCF_000013805.1_ASM1380v1_genomic.gbff.gz Yersinia_pseudotuberculosis_FDAARGOS_342/GCF_003546905.1_ASM354690v1/GCF_003546905.1_ASM354690v1_genomic.gbff.gz Yersinia_ruckeri_SC09/GCF_000775355.2_ASM77535v2/GCF_000775355.2_ASM77535v2_genomic.gbff.gz Yersinia_mollaretii_ATCC_43969/GCF_013282725.1_ASM1328272v1/GCF_013282725.1_ASM1328272v1_genomic.gbff.gz Yersinia_pestis_Pestoides_F/GCF_000016445.1_ASM1644v1/GCF_000016445.1_ASM1644v1_genomic.gbff.gz Yersinia_pestis_Angola_bis/GCF_000834845.1_ASM83484v1/GCF_000834845.1_ASM83484v1_genomic.gbff.gz Yersinia_ruckeri_17Y0412/GCF_021399055.1_ASM2139905v1/GCF_021399055.1_ASM2139905v1_genomic.gbff.gz Yersinia_pestis_1522/GCF_001188715.1_ASM118871v1/GCF_001188715.1_ASM118871v1_genomic.gbff.gz Yersinia_enterocolitica_MGYG-HGUT-02335/GCF_902385945.1_UHGG_MGYG-HGUT-02335/GCF_902385945.1_UHGG_MGYG-HGUT-02335_genomic.gbff.gz Yersinia_pestis_C-792/GCF_015337085.2_ASM1533708v2/GCF_015337085.2_ASM1533708v2_genomic.gbff.gz Yersinia_ruckeri_NVI-11050/GCF_023212385.2_ASM2321238v2/GCF_023212385.2_ASM2321238v2_genomic.gbff.gz Yersinia_intermedia_NCTC11469/GCF_900635455.1_28307_A01/GCF_900635455.1_28307_A01_genomic.gbff.gz Yersinia_pseudotuberculosis_FDAARGOS_583/GCF_003798285.1_ASM379828v1/GCF_003798285.1_ASM379828v1_genomic.gbff.gz Yersinia_pestis_M2029/GCF_015336265.1_ASM1533626v1/GCF_015336265.1_ASM1533626v1_genomic.gbff.gz Yersinia_enterocolitica_Gp200/GCF_025758555.1_ASM2575855v1/GCF_025758555.1_ASM2575855v1_genomic.gbff.gz Yersinia_massiliensis_GTA/GCF_003048255.1_ASM304825v1/GCF_003048255.1_ASM304825v1_genomic.gbff.gz Yersinia_pestis_A1122_bis/GCF_000834755.1_ASM83475v1/GCF_000834755.1_ASM83475v1_genomic.gbff.gz Yersinia_pseudotuberculosis_NZYP4713/GCF_900092345.1_YP4713/GCF_900092345.1_YP4713_genomic.gbff.gz Yersinia_pestis_PBM19/GCF_000834235.1_ASM83423v1/GCF_000834235.1_ASM83423v1_genomic.gbff.gz Yersinia_enterocolitica_NW116/GCF_025758575.1_ASM2575857v1/GCF_025758575.1_ASM2575857v1_genomic.gbff.gz Yersinia_ruckeri_KMM821/GCF_017498685.1_ASM1749868v1/GCF_017498685.1_ASM1749868v1_genomic.gbff.gz Yersinia_ruckeri_NVI-4840/GCF_026435215.1_ASM2643521v1/GCF_026435215.1_ASM2643521v1_genomic.gbff.gz Yersinia_enterocolitica_FDAARGOS_1082/GCF_016727765.1_ASM1672776v1/GCF_016727765.1_ASM1672776v1_genomic.gbff.gz Yersinia_enterocolitica_NW51/GCF_025758615.1_ASM2575861v1/GCF_025758615.1_ASM2575861v1_genomic.gbff.gz Yersinia_ruckeri_NVI-11076/GCF_023212325.2_ASM2321232v2/GCF_023212325.2_ASM2321232v2_genomic.gbff.gz Yersinia_rohdei_YRA/GCF_000834455.1_ASM83445v1/GCF_000834455.1_ASM83445v1_genomic.gbff.gz Yersinia_pestis_C-781/GCF_015336085.1_ASM1533608v1/GCF_015336085.1_ASM1533608v1_genomic.gbff.gz Yersinia_pestis_Harbin_35/GCF_000186725.1_ASM18672v1/GCF_000186725.1_ASM18672v1_genomic.gbff.gz Yersinia_pseudotuberculosis_ATCC_6904/GCF_000750315.1_ASM75031v1/GCF_000750315.1_ASM75031v1_genomic.gbff.gz Yersinia_pseudotuberculosis_FDAARGOS_580/GCF_003798445.1_ASM379844v1/GCF_003798445.1_ASM379844v1_genomic.gbff.gz Yersinia_enterocolitica_str_YE5303/GCF_000968115.1_ASM96811v1/GCF_000968115.1_ASM96811v1_genomic.gbff.gz Yersinia_pestis_FDAARGOS_601/GCF_003798225.1_ASM379822v1/GCF_003798225.1_ASM379822v1_genomic.gbff.gz Yersinia_pestis_SCPM-O-B-6291_C-25/GCF_009296005.1_ASM929600v1/GCF_009296005.1_ASM929600v1_genomic.gbff.gz Yersinia_pestis_Nairobi/GCF_000835005.1_ASM83500v1/GCF_000835005.1_ASM83500v1_genomic.gbff.gz Yersinia_pseudotuberculosis_FDAARGOS_584/GCF_003798385.1_ASM379838v1/GCF_003798385.1_ASM379838v1_genomic.gbff.gz Yersinia_similis_228/GCF_000582515.1_ASM58251v1/GCF_000582515.1_ASM58251v1_genomic.gbff.gz Yersinia_pestis_1413/GCF_001188935.1_ASM118893v1/GCF_001188935.1_ASM118893v1_genomic.gbff.gz Yersinia_pseudotuberculosis_FDAARGOS_581/GCF_003798425.1_ASM379842v1/GCF_003798425.1_ASM379842v1_genomic.gbff.gz Yersinia_entomophaga_MH96/GCF_001656035.1_ASM165603v1/GCF_001656035.1_ASM165603v1_genomic.gbff.gz Yersinia_ruckeri_NVI-1176/GCF_026435295.1_ASM2643529v1/GCF_026435295.1_ASM2643529v1_genomic.gbff.gz Yersinia_pestis_S19960127/GCF_015190655.1_ASM1519065v1/GCF_015190655.1_ASM1519065v1_genomic.gbff.gz Yersinia_ruckeri_NVI-4479/GCF_026435255.1_ASM2643525v1/GCF_026435255.1_ASM2643525v1_genomic.gbff.gz Yersinia_frederiksenii_Y225/GCF_000834215.1_ASM83421v1/GCF_000834215.1_ASM83421v1_genomic.gbff.gz Yersinia_ruckeri_NVI-4570/GCF_026435235.1_ASM2643523v1/GCF_026435235.1_ASM2643523v1_genomic.gbff.gz Yersinia_pseudotuberculosis_IP2666pIB1/GCF_003814345.1_ASM381434v1/GCF_003814345.1_ASM381434v1_genomic.gbff.gz Yersinia_pseudotuberculosis_FDAARGOS_582/GCF_003798405.1_ASM379840v1/GCF_003798405.1_ASM379840v1_genomic.gbff.gz Yersinia_enterocolitica_NCTC13769/GCF_900637005.1_46582_C01/GCF_900637005.1_46582_C01_genomic.gbff.gz Yersinia_pestis_A1122/GCF_000222975.1_ASM22297v1/GCF_000222975.1_ASM22297v1_genomic.gbff.gz Yersinia_enterocolitica_YE165/GCF_001708575.1_ASM170857v1/GCF_001708575.1_ASM170857v1_genomic.gbff.gz Yersinia_pseudotuberculosis_IP32953/GCF_000047365.1_ASM4736v1/GCF_000047365.1_ASM4736v1_genomic.gbff.gz Yersinia_pestis_8787/GCF_001188755.1_ASM118875v1/GCF_001188755.1_ASM118875v1_genomic.gbff.gz Yersinia_rochesterensis_ATCC_33639/GCF_000750355.1_ASM75035v1/GCF_000750355.1_ASM75035v1_genomic.gbff.gz Yersinia_pestis_FDAARGOS_603/GCF_003798205.1_ASM379820v1/GCF_003798205.1_ASM379820v1_genomic.gbff.gz Yersinia_pseudotuberculosis_PB1+_bis/GCF_000834475.1_ASM83447v1/GCF_000834475.1_ASM83447v1_genomic.gbff.gz Yersinia_ruckeri_NVI-11294/GCF_026435315.1_ASM2643531v1/GCF_026435315.1_ASM2643531v1_genomic.gbff.gz Yersinia_enterocolitica_NW66/GCF_025758595.1_ASM2575859v1/GCF_025758595.1_ASM2575859v1_genomic.gbff.gz Yersinia_pestis_1045/GCF_001188735.1_ASM118873v1/GCF_001188735.1_ASM118873v1_genomic.gbff.gz; do
        output=$(python3 extract_CDS_of_a_locus_tag.py ${gbff} $(echo "${gene_id}" | cut -d '_' -f 1-2))
        if [[ ! -z "${output}" ]]; then
            gbff_short=$(echo "${gbff}" | cut -d '/' -f 1)
            printf "%s\t%s\n" "${gbff_short}" "${output}" >> yopM_seq.txt
        fi
      done
    done
  5. extract the sequences according to NCBI annotations

    #------------------------------- yopJ (+6) -------------------------------
    #grep "yopJ" selected_gtf_files/Yersinia_enterocolitica_2516-87.gtf
    NZ_CP009837.1   RefSeq  gene    69041   69701   .       -       .       gene_id "CH48_RS00445"; transcript_id ""; gbkey "Gene"; gene "yopJ"; gene_biotype "protein_coding"; locus_tag "CH48_RS00445"; old_locus_tag "CH48_4238"; part "2"; 
    NZ_CP009837.1   RefSeq  gene    1       206     .       -       .       gene_id "CH48_RS00445"; transcript_id ""; gbkey "Gene"; gene "yopJ"; gene_biotype "protein_coding"; locus_tag "CH48_RS00445"; old_locus_tag "CH48_4238"; part "1"; 
    
    #grep "yopJ" selected_gtf_files/Yersinia_pestis_790.gtf (NZ_CP006807.1)
    
    #grep "yopJ" selected_gtf_files/Yersinia_pestis_Antiqua_bis.gtf
    NZ_CP009905.1   RefSeq  gene    16737   17602   .       -       .       gene_id "CH58_RS00725"; transcript_id ""; gbkey "Gene"; gene "yopJ"; gene_biotype "pseudogene"; locus_tag "CH58_RS00725"; old_locus_tag "CH58_4444"; pseudo "true"; 
    
    #grep "yopJ" selected_gtf_files/Yersinia_pestis_FDAARGOS_602.gtf
    NZ_CP033695.1   RefSeq  gene    36152   37017   .       +       .       gene_id "EGX42_RS00935"; transcript_id ""; gbkey "Gene"; gene "yopJ"; gene_biotype "pseudogene"; locus_tag "EGX42_RS00935"; old_locus_tag "EGX42_00930"; pseudo "true"; 
    
    #grep "yopJ" selected_gtf_files/Yersinia_pestis_Pestoides_B.gtf 
    NZ_CP010022.1   RefSeq  gene    23121   23986   .       -       .       gene_id "CH60_RS00825"; transcript_id ""; gbkey "Gene"; gene "yopJ"; gene_biotype "pseudogene"; locus_tag "CH60_RS00825"; old_locus_tag "CH60_4301"; pseudo "true"; 
    
    #grep "yopJ" selected_gtf_files/Yersinia_pseudotuberculosis_EP2+.gtf
    NZ_CP009758.1   RefSeq  gene    33302   34168   .       +       .       gene_id "BZ20_RS00215"; transcript_id ""; gbkey "Gene"; gene "yopJ"; gene_biotype "pseudogene"; locus_tag "BZ20_RS00215"; old_locus_tag "BZ20_4189"; pseudo "true"; 
    
    #under selected_fna_files
    samtools faidx Yersinia_enterocolitica_2516-87.fna NZ_CP009837.1:69041-69701 > temp.fna
    samtools faidx Yersinia_enterocolitica_2516-87.fna NZ_CP009837.1:1-206 >> temp.fna
    revseq
    sed -e ':a;N;$!ba;s/\n//g' -e 's/:/\t/g' yersinia_enterocolitica_2516-87.rev > temp_.fna
    
    samtools faidx Yersinia_pestis_Antiqua_bis.fna NZ_CP009905.1:16737-17602 > temp.fna
    revseq
    sed -e ':a;N;$!ba;s/\n//g' -e 's/:/\t/g' 16737-17602.rev > temp_.fna
    
    samtools faidx Yersinia_pestis_FDAARGOS_602.fna NZ_CP033695.1:36152-37017 > temp.fna
    sed -i -e ':a;N;$!ba;s/\n//g' -e 's/:/\t/g' temp.fna
    
    samtools faidx Yersinia_pestis_Pestoides_B.fna NZ_CP010022.1:23121-23986 > temp.fna
    revseq
    sed -e ':a;N;$!ba;s/\n//g' -e 's/:/\t/g' 23121-23986.rev > temp_.fna
    
    samtools faidx Yersinia_pseudotuberculosis_EP2+.fna NZ_CP009758.1:33302-34168 > temp.fna
    sed -i -e ':a;N;$!ba;s/\n//g' -e 's/:/\t/g' temp.fna
    
    Yersinia_enterocolitica_2516-87 ATGATTGGGCCAATATCACAAATAAACAGCTTCGGTGGCTTATCAGAAAAAGAGACCCGTTCTTTAATCAGTAATGAAGAGCTTAAAAATATCATAATACAGTTGGAAACTGATATAGCGGATGGATCCTGGTTCCATAAAAATTATTCACGCCTGGATATAGAAGTCATGCCCGCATTAGTAATTCAGGCGAACAATAAATATCCGGAAATGAATCTTAATTTTGTTACATCTCCCCAGGACCTTTCGATAGAAATAAAAAATGTCATAGAAAATGGAGTTGGATCTTCCCGCTTCATAATTAACATGGGGGAGGGTGGAATACATTTCAGTGTAATTGATTACAAACATATAAATGGGAAAACATCTCTGATATTATTTGAACCAGTAAACTTTAATAGTATGGGGCCAGCGATACTGGCAATAAGTACAAAAACGGCCATTGAACGTTATCAATTACCTGATTGCCATTTTTCCATGGTGGAAATGGATATTCAGCGAAGCTCATCTGAATGTGGTATTTTTAGTTTGGCACTGGCAAAAAAACTTTACACCGAGAGAGATAGCCTGTTGAAAATACATGAAGATAATATAAAAGGTATATTAAGTGATAGTGAAAATCCTTTACCCCACAATAAGTTGGATCCGTATCTCCCGGTAACTTTTTACAAACATACTCAAGGTAAAAAACGTCTTAATGAATATTTAAATACTAACCCGCAGGGAGTTGGTACTGTTGTTAACAAAAAAAATGAAACCATCTTTAATAGGTTTGATAACAATAAATCCATTATAGATGGAAAGGAATTATCAGTTTCGGTACATAAAAAGAGAATAGCTGAATATAAAACACTTCTCAAAGTATAA
    Yersinia_pestis_Antiqua_bis ATGATCGGACCAATATCACAAATAAATATCTCCGGTGGCTTATCAGAAAAAGAGACCAGTTCTTTAATCAGTAATGAAGAGCTTAAAAATATCATAACACAGTTGGAAACTGATATATCGGATGGATCCTGGTTCCATAAAAATTATTCACGTATGGATGTAGAAGTCATGCCCGCATTGGTAATCCAGGCGAACAATAAATATCCGGAAATGAATCTTAATCTTGTTACATCTCCATTGGACCTTTCAATAGAAATAAAAAACGTCATAGAAAATGGAGTTAGATCTTCCCGCTTCATAATTAACATGGGGGAAGGTGGAATACATTTCAGTGTAATTGATTACAAACATATAAATGGGAAAACATCTCTGATATTGTTTGAACCAGCAAACTTTAACAGTATGGGGCCAGCGATGCTGGCAATAAGGACAAAAACGGCTATTGAACGTTATCAATTACCTGATTGCCATTTCTCCATGGTGGAAATGGATATTCAGCGAAGCTCATCTGAATGTGGTATTTTTAGTTTTGCACTGGCAAAAAAACTTTACATCGAGAGAGATAGCCTGTTGAAAATACATGAAGATAATATAAAAGGTATATTAAGTGATGGTAAAAATCCTTTACCCCACGATAAGTTGGACCCGTATCTCCCGGTAACTTTTTACAAACATACTCAAGGTAAAAAACGTCTTAATGAATATTTAAATACTAACCCGCAGGGAGTTGGTACTGTTGTTAACAAAAAAATGAAACCATCGTTAATAGATTTGATAACAATAAATCCATTGTAGATGGAAAGGAATTATCAGTTTCGGTACATAAAAAGAGAATAGCTGAATATAAAACACTTCTCAAAGTATAA
    >Yersinia_pestis_FDAARGOS_602   ATGATCGGACCAATATCACAAATAAATATCTCCGGTGGCTTATCAGAAAAAGAGACCAGTTCTTTAATCAGTAATGAAGAGCTTAAAAATATCATAACACAGTTGGAAACTGATATATCGGATGGATCCTGGTTCCATAAAAATTATTCACGTATGGATGTAGAAGTCATGCCCGCATTGGTAATCCAGGCGAACAATAAATATCCGGAAATGAATCTTAATCTTGTTACATCTCCATTGGACCTTTCAATAGAAATAAAAAACGTCATAGAAAATGGAGTTAGATCTTCCCGCTTCATAATTAACATGGGGGAAGGTGGAATACATTTCAGTGTAATTGATTACAAACATATAAATGGGAAAACATCTCTGATATTGTTTGAACCAGCAAACTTTAACAGTATGGGGCCAGCGATGCTGGCAATAAGGACAAAAACGGCTATTGAACGTTATCAATTACCTGATTGCCATTTCTCCATGGTGGAAATGGATATTCAGCGAAGCTCATCTGAATGTGGTATTTTTAGTTTTGCACTGGCAAAAAAACTTTACATCGAGAGAGATAGCCTGTTGAAAATACATGAAGATAATATAAAAGGTATATTAAGTGATGGTGAAAATCCTTTACCCCACGATAAGTTGGACCCGTATCTCCCGGTAACTTTTTACAAACATACTCAAGGTAAAAAACGTCTTAATGAATATTTAAATACTAACCCGCAGGGAGTTGGTACTGTTGTTAACAAAAAAAATGAAACCATCGTTAATAGATTTGATAACAATAAATCCATTGTAGATGGAAAGGAATTATCAGTTTCGTACATAAAAAGAGAATAGCTGAATATAAAACACTTCTCAAAGTATAA
    >Yersinia_pestis_Pestoides_B    ATGATCGGACCAATATCACAAATAAATATCTCCGGTGGCTTATCAGAAAAAGAGACCAGTTCTTTAATCAGTAATGAAGAGCTTAAAAATATCATAACACAGTTGGAAACTGATATATCGGATGGATCCTGGTTCCATAAAAATTATTCACGTATGGATGTAGAAGTCATGCCCGCATTGGTAATCCAGGCGAACAATAAATATCCGGAAATGAATCTTAATCTTGTTACATCTCCATTGGACCTTTCAATAGAAATAAAAAACGTCATAGAAAATGGAGTTAGATCTTCCCGCTTCATAATTAACATGGGGGAAGGTGGAATACATTTCAGTGTAATTGATTACAAACATATAAATGGGAAAACATCTCTGATATTGTTTGAACCAGCAAACTTTAACAGTATGGGGCCAGCGATGCTGGCAATAAGGACAAAAACGGCTATTGAACGTTATCAATTACCTGATTGCCATTTCTCCATGGTGGAAATGGATATTCAGCGAAGCTCATCTGAATGTGGTATTTTTAGTTTTGCACTGGCAAAAAAACTTTACATCGAGAGAGATAGCCTGTTGAAAATACATGAAGATAATATAAAAGGTATATTAAGTGATGGTGAAAATCCTTTACCCCACGATAAGTTGGACCCGTATCTCCCGGTAACTTTTTACAAACATACTCAAGGTAAAAAACGTCTTAATGAATATTTAAATACTAACCCGCAGGGAGTTGGTACTGTTGTTAACAAAAAAAATGAAACCATCGTTAATAGATTTGATAACAATAAATCCATTGTAGATGGAAAGGAATTATCAGTTTCGTACATAAAAAGAGAATAGCTGAATATAAAACACTTCTCAAAGTATAA
    Yersinia_pseudotuberculosis_EP2+    ATGATCGGACCAATATCACAAATAAATATCTCCGGTGGCTTATCAGAAAAAGAGACCAGTTCTTTAATCAGTAATGAAGAGCTTAAAAATATCATAACACAGTTGGAAACTGATATATCGGATGGATCCTGGTTCCATAAAAATTATTCACGTATGGATGTAGAAGTCATGCCCGCATTGGTAATCTAGGCGAACAATAAATATCCGGAAATGAATCTTAATCTTGTTACATCTCCATTGGACCTTTCAATAGAAATAAAAAACGTCATAGAAAATGGAGTTAGATCTTCCCGCTTCATAATTAACATGGGGGAAGGTGGAATACATTTCAGTGTAATTGATTACAAACATATAAATGGGAAAACATCTCTGATATTGTTTGAACCAGCAAACTTTAACAGTATGGGGCCAGCGATGCTGGCAATAAGGACAAAAACGGCTATTGAACGTTATCAATTACCTGATTGCCATTTCTCCATGGTGGAAATGGATATTCAGCGAAGCTCATCTGAATGTGGTATTTTTAGTTTTGCACTGGCAAAAAAACTTTACATCGAGAGAGATAGCCTGTTGAAAATACATGAAGATAATATAAAAGGTATATTAAGTGATGGTGAAAATCCTTTACCCCACGATAAGTTGGACCCGTATCTCCCGGTAACTTTTTACAAACATACTCAAGGTAAAAAACGTCTTAATGAATATTTAAATACTAACCCGCAGGGAGTTGGTACTGTTGTTAACAAAAAAAATGAAACCATCGTTAATAGATTTGATAACAATAAATCCATTGTAGATGGAAAGGAATTATCAGTTTCGGTACATAAAAAGAGAATAGCTGAATATAAAACACTTCTCAAAGTATAA
    
    #------------------------------- yopB (+4) -------------------------------
    
    #-- grep "yopB" Yersinia_enterocolitica_YE1.gtf
    
    grep "yopB" Yersinia_enterocolitica_YE1.gtf
    NZ_CP016946.1   RefSeq  gene    73029   73029   .       +       .       gene_id "BFS78_RS21560"; transcript_id ""; gbkey "Gene"; gene "yopB"; gene_biotype "protein_coding"; locus_tag "BFS78_RS21560"; old_locus_tag "BFS78_21560"; part "1"; 
    NZ_CP016946.1   RefSeq  gene    1       1205    .       +       .       gene_id "BFS78_RS21560"; transcript_id ""; gbkey "Gene"; gene "yopB"; gene_biotype "protein_coding"; locus_tag "BFS78_RS21560"; old_locus_tag "BFS78_21560"; part "2";
    
    #-- grep "yopB" Yersinia_enterocolitica_YE3.gtf
    
    NZ_CP016943.1   RefSeq  gene    72880   73026   .       +       .       gene_id "BED35_RS00480"; transcript_id ""; gbkey "Gene"; gene "yopB"; gene_biotype "pseudogene"; locus_tag "BED35_RS00480"; old_locus_tag "BED35_00480"; part "1"; pseudo "true"; 
    NZ_CP016943.1   RefSeq  gene    1       1058    .       +       .       gene_id "BED35_RS00480"; transcript_id ""; gbkey "Gene"; gene "yopB"; gene_biotype "pseudogene"; locus_tag "BED35_RS00480"; old_locus_tag "BED35_00480"; part "2"; pseudo "true";
    
    grep "yopB" Yersinia_enterocolitica_YE5.gtf
    NZ_CP016939.1   RefSeq  gene    73034   73034   .       +       .       gene_id "BED32_RS00010"; transcript_id ""; gbkey "Gene"; gene "yopB"; gene_biotype "protein_coding"; locus_tag "BED32_RS00010"; old_locus_tag "BED32_00010"; part "1"; 
    NZ_CP016939.1   RefSeq  gene    1       1205    .       +       .       gene_id "BED32_RS00010"; transcript_id ""; gbkey "Gene"; gene "yopB"; gene_biotype "protein_coding"; locus_tag "BED32_RS00010"; old_locus_tag "BED32_00010"; part "2"; 
    
    #-- grep "yopB" Yersinia_pestis_Harbin_35_bis.gtf
    NZ_CP009703.1   RefSeq  gene    18869   20075   .       +       .       gene_id "CH55_RS00745"; transcript_id ""; gbkey "Gene"; gene "yopB"; gene_biotype "pseudogene"; locus_tag "CH55_RS00745"; old_locus_tag "CH55_4304"; pseudo "true"; 
    
    #under selected_fna_files
    samtools faidx Yersinia_enterocolitica_YE1.fna NZ_CP016946.1:73029-73029 > temp.fna
    samtools faidx Yersinia_enterocolitica_YE1.fna NZ_CP016946.1:1-1205 >> temp.fna
    samtools faidx Yersinia_enterocolitica_YE3.fna NZ_CP016943.1:72880-73026 > temp.fna
    samtools faidx Yersinia_enterocolitica_YE3.fna NZ_CP016943.1:1-1058 >> temp.fna
    samtools faidx Yersinia_enterocolitica_YE5.fna NZ_CP016939.1:73034-73034 > temp.fna
    samtools faidx Yersinia_enterocolitica_YE5.fna NZ_CP016939.1:1-1205 >> temp.fna
    samtools faidx Yersinia_pestis_Harbin_35_bis.fna NZ_CP009703.1:18869-20075 > temp.fna
    
    Yersinia_enterocolitica_YE1 ATGAGTGCGTTGATAACCCATGATCGCTCAACGCCAGTAACTGGAAGTCTAGTTCCCTACATCGAGACACCAGCGCCCGCCCCCCTTCAGACCCAACAAGTCGCGGGAGAACTGAAGGATAAAAATGGCGGGGTGAGTTCTCAGGGCGTGCAGCTCCCTGCACCACTAGCAGTGGTTGCCAGCCAAGTCACTGAAGGACAACAGCAAGAAATCACTAAATTATTGGAGTCGGTCACCCGCGGCACGGCAGGATCTCAACTGATATCAAATTATGTTTCAGTGCTAACGAATTTTACGCTCGCTTCACCTGATACATTTGAGATTGAGTTAGGTAAGCTAGTTTCTAATTTAGAAGAAGTACGCAAAGACATAAAAATCGCTGATATTCAGCGTCTTCATGAACAAAACATGAAGAAAATTGAAGAGAATCAAGAGAAAATCAAAGAAACAGAAGAGAATGCCAAGCAAGTCAAGAAATCCGGCATGGCATCAAAGATTTTTGGCTGGCTCAGCGCCATAGCCTCAGTGGTTATCGGTGCCATCATGGTGGCCTCAGGGGTAGGAGCCGTTGCCGGTGCAATGATGATTGCCTCAGGCGTAATTGGGATGGCGAATATGGCTGTGAAACAAGCGGCGGAAGATGGCCTGATATCCCAAGAGGCAATGCAAGTATTAGGGCCGATACTCACTGCGATTGAAGTCGCATTGACTGTAGTTTCAACCGTAATGACCTTTGGCGGTTCGGCACTAAAATGCCTGGCTGATATTGGCGCAAAACTCGGTGCTAACACCGCAAGTCTTGCTGCTAAAGGAGCCGAGTTTTCAGCCAAAGTTGCCCAAATTTCGACAGGCATATCAAACACTGTCGGGAGTGCAGTGACTAAATTAGGGGGCAGTTTTGGTAGTTTAACAATGAGCCATGTAATCCGTACAGGATCACAGGCAACACAAGTCGCCGTTGGTGTGGGCAGCGGAATAACTCAGACCATCAATAATAAAAAACAAGCTGATTTACAACATAATAACGCTGATTTGGCCTTGAACAAGGCAGACATGGCAGCGTTACAAAGTATTATTGACCGACTCAAAGAAGAGTTATCCCATTTGTCAGAGTCACATCAACAAGTGATGGAACTGATTTTCCAGATGATTAATGCAAAAGGTGACATGCTGCATAATTTGGCCGGCAGACCCCATACTGTTTAA
    Yersinia_enterocolitica_YE3 ATGAGTGCGTTGATAACCCATGATCGCTCAACGCCAGTAACTGGAAGTCTAGTTCCCTACATCGAGACACCAGCGCCCGCCCCCTTCAGACCCAACAAGTCGCGGGAGAACTGAAGGATAAAAATGGCGGGGTGAGTTCTCAGGGCGTGCAGCTCCCTGCACCACTAGCAGTGGTTGCCAGCCAAGTCACTGAAGGACAACAGCAAGAAATCACTAAATTATTGGAGTCGGTCACCCGCGGCACGGCAGGATCTCAACTGATATCAAATTATGTTTCAGTGCTAACGAATTTTACGCTCGCTTCACCTGATACATTTGAGATTGAGTTAGGTAAGCTAGTTTCTAATTTAGAAGAAGTACGCAAAGACATAAAAATCGCTGATATTCAGCGTCTTCATGAACAAAACATGAAGAAAATTGAAGAGAATCAAGAGAAAATCAAAGAAACAGAAGAGAATGCCAAGCAAGTCAAGAAATCCGGCATGGCATCAAAGATTTTTGGCTGGCTCAGCGCCATAGCCTCAGTGGTTATCGGTGCCATCATGGTGGCCTCAGGGGTAGGAGCCGTTGCCGGTGCAATGATGATTGCCTCAGGCGTAATTGGGATGGCGAATATGGCTGTGAAACAAGCGGCGGAAGATGGCCTGATATCCCAAGAGGCAATGCAAGTATTAGGGCCGATACTCACTGCGATTGAAGTCGCATTGACTGTAGTTTCAACCGTAATGACCTTTGGCGGTTCGGCACTAAAATGCCTGGCTGATATTGGCGCAAAACTCGGTGCTAACACCGCAAGTCTTGCTGCTAAAGGAGCCGAGTTTTCAGCCAAAGTTGCCCAAATTTCGACAGGCATATCAAACACTGTCGGGAGTGCAGTGACTAAATTAGGGGGCAGTTTTGGTAGTTTAACAATGAGCCATGTAATCCGTACAGGATCACAGGCAACACAAGTCGCCGTTGGTGTGGGCAGCGGAATAACTCAGACCATCAATAATAAAAAACAAGCTGATTTACAACATAATAACGCTGATTTGGCCTTGAACAAGGCAGACATGGCAGCGTTACAAAGTATTATTGACCGACTCAAAGAAGAGTTATCCCATTTGTCAGAGTCACATCAACAAGTGATGGAACTGATTTTCCAGATGATTAATGCAAAAGGTGACATGCTGCATAATTTGGCCGGCAGACCCCATACTGTTTAA
    Yersinia_enterocolitica_YE5 ATGAGTGCGTTGATAACCCATGATCGCTCAACGCCAGTAACTGGAAGTCTAGTTCCCTACATCGAGACACCAGCGCCCGCCCCCCTTCAGACCCAACAAGTCGCGGGAGAACTGAAGGATAAAAATGGCGGGGTGAGTTCTCAGGGCGTGCAGCTCCCTGCACCACTAGCAGTGGTTGCCAGCCAAGTCACTGAAGGACAACAGCAAGAAATCACTAAATTATTGGAGTCGGTCACCCGCGGCACGGCAGGATCTCAACTGATATCAAATTATGTTTCAGTGCTAACGAATTTTACGCTCGCTTCACCTGATACATTTGAGATTGAGTTAGGTAAGCTAGTTTCTAATTTAGAAGAAGTACGCAAAGACATAAAAATCGCTGATATTCAGCGTCTTCATGAACAAAACATGAAGAAAATTGAAGAGAATCAAGAGAAAATCAAAGAAACAGAAGAGAATGCCAAGCAAGTCAAGAAATCCGGCATGGCATCAAAGATTTTTGGCTGGCTCAGCGCCATAGCCTCAGTGGTTATCGGTGCCATCATGGTGGCCTCAGGGGTAGGAGCCGTTGCCGGTGCAATGATGATTGCCTCAGGCGTAATTGGGATGGCGAATATGGCTGTGAAACAAGCGGCGGAAGATGGCCTGATATCCCAAGAGGCAATGCAAGTATTAGGGCCGATACTCACTGCGATTGAAGTCGCATTGACTGTAGTTTCAACCGTAATGACCTTTGGCGGTTCGGCACTAAAATGCCTGGCTGATATTGGCGCAAAACTCGGTGCTAACACCGCAAGTCTTGCTGCTAAAGGAGCCGAGTTTTCAGCCAAAGTTGCCCAAATTTCGACAGGCATATCAAACACTGTCGGGAGTGCAGTGACTAAATTAGGGGGCAGTTTTGGTAGTTTAACAATGAGCCATGTAATCCGTACAGGATCACAGGCAACACAAGTCGCCGTTGGTGTGGGCAGCGGAATAACTCAGACCATCAATAATAAAAAACAAGCTGATTTACAACATAATAACGCTGATTTGGCCTTGAACAAGGCAGACATGGCAGCGTTACAAAGTATTATTGACCGACTCAAAGAAGAGTTATCCCATTTGTCAGAGTCACATCAACAAGTGATGGAACTGATTTTCCAGATGATTAATGCAAAAGGTGACATGCTGCATAATTTGGCCGGCAGACCCCATACTGTTTAA
    Yersinia_pestis_Harbin_35_bis   ATGAGTGCGTTGATAACCCATGACCGCTCAACGCCAGTAACTGGAAGTCTACTTCCCTACGTCGAGACACCAGCGCCCGCCCCCCCTTCAGACCCAACAAGTCGCGGGAGAACTGAAGGATAAAAATGGCGGGGTGAGTTCTCAGGGCGTACAGCTCCCTGCACCACTAGCAGTGGTTGCCAGCCAAGTTACTGAAGGACAACAGCAAGAAGTCACTAAATTATTGGAGTCGGTCACCCGCGGCGCGGCAGGATCTCAACTGATATCAAATTATGTTTCAGTGCTAACGAAGTTTACGCTTGCTTCACCTGATACATTTGAGATTGAGTTAGGTAAGCTAGTTTCTAATTTAGAAGAAGTACGCAAAGACATAAAAATCGCTGATATTCAGCGTCTTCATGAACAAAACATGAAGAAAATTGAAGAGAATCAAGAGAAAATCAAAGAAACAGAAGAGAATGCCAAGCAAGTCAAGAAATCCGGCATCGCATCAAAGATTTTTGGCTGGCTCAGCGCCATAGCCTCAGTGATTGTCGGTGCCATCATGGTGGCCTCAGGGGTAGGAGCCGTTGCCGGTGCAATGATGGTTGCCTCAGGCGTAATTGGGATGGCGAATATGGCAGTGAAACAAGCGGCGGAAGATGGCCTGATATCCCAAGAGGCAATGAAAATATTAGGGCCGATACTCACTGCGATTGAAGTCGCATTGACTGTAGTTTCAACCGTAATGACCTTTGGCGGTTCGGCACTAAAATGCCTGGCTAATATTGGCGCAAAACTCGGTGCTAACACCGCAAGTCTTGTGGCTAAAGGAGCCGAGTTTTCGGCCAAAGTTGCCCAAATTTCGACAGGCATATCAAACACTGTCGGGAGTGCAGTGACTAAATTAGGGGGCAGTTTTGCTGGTTTAACAATGAGCCATGCAATCCGTACAGGATCACAGGCAACACAAGTCGCCGTTGGTGTGGGCAGCGGAATAACTCAGACCATCAATAATAAAAAGCAAGCTGATTTACAACATAATAACGCTGATTTGGCCTTGAACAAGGCAGACATGGCAGCGTTACAAAGTATTATTGACCGACTCAAAGAAGAGTTATCCCATTTGTCAGAGTCACATCAACAAGTGATGGAACTGATTTTCCAGATGATTAATGCAAAAGGTGACATGCTGCATAATTTGGCCGGCAGACCCCATACTGTTTAA
    
    #------------------------------- yopT (+9) -------------------------------
    #grep "yopT" selected_gtf_files/Yersinia_pestis_1412.gtf
    NZ_CP006780.1   RefSeq  gene    43360   44327   .       +       .       gene_id "M479_RS22185"; transcript_id ""; gbkey "Gene"; gene "yopT"; gene_biotype "pseudogene"; locus_tag "M479_RS22185"; old_locus_tag "M479_4302"; pseudo "true";
    
    #grep "yopT" selected_gtf_files/Yersinia_pestis_1413.gtf
    NZ_CP006761.1   RefSeq  gene    60310   61277   .       +       .       gene_id "M480_RS22170"; transcript_id ""; gbkey "Gene"; gene "yopT"; gene_biotype "pseudogene"; locus_tag "M480_RS22170"; old_locus_tag "M480_4319"; pseudo "true"; 
    
    #grep "yopT" selected_gtf_files/Yersinia_pestis_1522.gtf
    NZ_CP006757.1   RefSeq  gene    61673   62640   .       -       .       gene_id "M481_RS22190"; transcript_id ""; gbkey "Gene"; gene "yopT"; gene_biotype "pseudogene"; locus_tag "M481_RS22190"; old_locus_tag "M481_4325"; pseudo "true";
    
    samtools faidx Yersinia_pestis_1412.fna NZ_CP006780.1:43360-44327 > temp.fna
    sed -i -e ':a;N;$!ba;s/\n//g' -e 's/:/\t/g' temp.fna
    
    Yersinia_pestis_1412    ATGAACAGTATTCACGGACACTACCATATTCAACTATCGAATTATTCTGCCGGTGAAAACCTTCAATCAGTACCCTCACCGAAGGGGTGATTGGCGCACACCGAGTGAAAGTGGAAACAGCACTGTCACACTCAAACCTGCAGAAAAAGTTATCAGCCACCATAAAACATAACCAGTCAGGCCGTTCTATGCTGGATAGAAAGTTGACCAGCGACGGCAAAGCTAACCAACGCAGCAGCTTTACCTTCAGTATGATTATGTATCGCATGATACATTTTGTACTCAGCACTCGTGTGCCCGCGGTGAGAGAGTCTGTTGCAAATTACGGAGGTAACATCAATTTCAAGTTTGCTCAGACCAAAGGGGCTTTTCTTCATAAAATAATAAAACATTCAGACACTGCTAGCGGTGTCTGTGAGGCTTTATGTGCACATTGGATCAGGAACCATGCACAAGGCCAAAGCTTATTTGACCAGCTCTATGTTGGCGGGCGTAAGGGGAAATTCCAGATCGATACACTTTACTCAATTAAACAGTTGCAAATAGATGGTTGTAAAGCAGACGTTGATCAAGATGAGGTAACACTAGATTGGTTCAAGAAAAATGGCATATCAGAACGTATGATTGAACGGCATTGCTTACTGCGTCCAGTTGATGTTACTGGTACGACGGAATCAGAAGGGCTGGATCAATTATTAAACGCTATCCTTGATACTCATGGGATAGGTTACGGTTATAAAAAAATACATCTCTCAGGCCAAATGTCAGCCCACGCCATAGCGGCGTATGTCAACGAAAAGAGTGGTGTTACTTTCTTCGATCCCAATTTCGGTGAATTCCACTTTTCTGATAAGGAAAAGTTCCGCAAATGGTTTACTAACTCATTCTGGGGTAATTCTATGTATCATTATCCTCTGGGGGTGGGGCAGCGTTTTAGAGTGTTAACATTTGACTCCAAGGAGGTTTAA
    
    samtools faidx Yersinia_pestis_1413.fna NZ_CP006761.1:60310-61277 > temp.fna
    sed -i -e ':a;N;$!ba;s/\n//g' -e 's/:/\t/g' temp.fna
    
    Yersinia_pestis_1413    ATGAACAGTATTCACGGACACTACCATATTCAACTATCGAATTATTCTGCCGGTGAAAACCTTCAATCAGTACCCTCACCGAAGGGGTGATTGGCGCACACCGAGTGAAAGTGGAAACAGCACTGTCACACTCAAACCTGCAGAAAAAGTTATCAGCCACCATAAAACATAACCAGTCAGGCCGTTCTATGCTGGATAGAAAGTTGACCAGCGACGGCAAAGCTAACCAACGCAGCAGCTTTACCTTCAGTATGATTATGTATCGCATGATACATTTTGTACTCAGCACTCGTGTGCCCGCGGTGAGAGAGTCTGTTGCAAATTACGGAGGTAACATCAATTTCAAGTTTGCTCAGACCAAAGGGGCTTTTCTTCATAAAATAATAAAACATTCAGACACTGCTAGCGGTGTCTGTGAGGCTTTATGTGCACATTGGATCAGGAACCATGCACAAGGCCAAAGCTTATTTGACCAGCTCTATGTTGGCGGGCGTAAGGGGAAATTCCAGATCGATACACTTTACTCAATTAAACAGTTGCAAATAGATGGTTGTAAAGCAGACGTTGATCAAGATGAGGTAACACTAGATTGGTTCAAGAAAAATGGCATATCAGAACGTATGATTGAACGGCATTGCTTACTGCGTCCAGTTGATGTTACTGGTACGACGGAATCAGAAGGGCTGGATCAATTATTAAACGCTATCCTTGATACTCATGGGATAGGTTACGGTTATAAAAAAATACATCTCTCAGGCCAAATGTCAGCCCACGCCATAGCGGCGTATGTCAACGAAAAGAGTGGTGTTACTTTCTTCGATCCCAATTTCGGTGAATTCCACTTTTCTGATAAGGAAAAGTTCCGCAAATGGTTTACTAACTCATTCTGGGGTAATTCTATGTATCATTATCCTCTGGGGGTGGGGCAGCGTTTTAGAGTGTTAACATTTGACTCCAAGGAGGTTTAA
    
    samtools faidx Yersinia_pestis_1522.fna NZ_CP006757.1:61673-62640 > temp.fna
    revseq
    sed -e ':a;N;$!ba;s/\n//g' -e 's/:/\t/g' 61673-62640.rev > temp_.fna
    
    Yersinia_pestis_1522    ATGAACAGTATTCACGGACACTACCATATTCAACTATCGAATTATTCTGCCGGTGAAAACCTTCAATCAGTACCCTCACCGAAGGGGTGATTGGCGCACACCGAGTGAAAGTGGAAACAGCACTGTCACACTCAAACCTGCAGAAAAAGTTATCAGCCACCATAAAACATAACCAGTCAGGCCGTTCTATGCTGGATAGAAAGTTGACCAGCGACGGCAAAGCTAACCAACGCAGCAGCTTTACCTTCAGTATGATTATGTATCGCATGATACATTTTGTACTCAGCACTCGTGTGCCCGCGGTGAGAGAGTCTGTTGCAAATTACGGAGGTAACATCAATTTCAAGTTTGCTCAGACCAAAGGGGCTTTTCTTCATAAAATAATAAAACATTCAGACACTGCTAGCGGTGTCTGTGAGGCTTTATGTGCACATTGGATCAGGAACCATGCACAAGGCCAAAGCTTATTTGACCAGCTCTATGTTGGCGGGCGTAAGGGGAAATTCCAGATCGATACACTTTACTCAATTAAACAGTTGCAAATAGATGGTTGTAAAGCAGACGTTGATCAAGATGAGGTAACACTAGATTGGTTCAAGAAAAATGGCATATCAGAACGTATGATTGAACGGCATTGCTTACTGCGTCCAGTTGATGTTACTGGTACGACGGAATCAGAAGGGCTGGATCAATTATTAAACGCTATCCTTGATACTCATGGGATAGGTTACGGTTATAAAAAAATACATCTCTCAGGCCAAATGTCAGCCCACGCCATAGCGGCGTATGTCAACGAAAAGAGTGGTGTTACTTTCTTCGATCCCAATTTCGGTGAATTCCACTTTTCTGATAAGGAAAAGTTCCGCAAATGGTTTACTAACTCATTCTGGGGTAATTCTATGTATCATTATCCTCTGGGGGTGGGGCAGCGTTTTAGAGTGTTAACATTTGACTCCAAGGAGGTTTAA
    
    #grep "yopT" selected_gtf_files/Yersinia_pestis_3067.gtf
    NZ_CP006753.1   RefSeq  gene    43515   44482   .       +       .       gene_id "M482_RS22205"; transcript_id ""; gbkey "Gene"; gene "yopT"; gene_biotype "pseudogene"; locus_tag "M482_RS22205"; old_locus_tag "M482_4297"; pseudo "true"; 
    
    samtools faidx Yersinia_pestis_3067.fna NZ_CP006753.1:43515-44482 > temp.fna
    sed -i -e ':a;N;$!ba;s/\n//g' -e 's/:/\t/g' temp.fna
    
    Yersinia_pestis_3067    ATGAACAGTATTCACGGACACTACCATATTCAACTATCGAATTATTCTGCCGGTGAAAACCTTCAATCAGTACCCTCACCGAAGGGGTGATTGGCGCACACCGAGTGAAAGTGGAAACAGCACTGTCACACTCAAACCTGCAGAAAAAGTTATCAGCCACCATAAAACATAACCAGTCAGGCCGTTCTATGCTGGATAGAAAGTTGACCAGCGACGGCAAAGCTAACCAACGCAGCAGCTTTACCTTCAGTATGATTATGTATCGCATGATACATTTTGTACTCAGCACTCGTGTGCCCGCGGTGAGAGAGTCTGTTGCAAATTACGGAGGTAACATCAATTTCAAGTTTGCTCAGACCAAAGGGGCTTTTCTTCATAAAATAATAAAACATTCAGACACTGCTAGCGGTGTCTGTGAGGCTTTATGTGCACATTGGATCAGGAACCATGCACAAGGCCAAAGCTTATTTGACCAGCTCTATGTTGGCGGGCGTAAGGGGAAATTCCAGATCGATACACTTTACTCAATTAAACAGTTGCAAATAGATGGTTGTAAAGCAGACGTTGATCAAGATGAGGTAACACTAGATTGGTTCAAGAAAAATGGCATATCAGAACGTATGATTGAACGGCATTGCTTACTGCGTCCAGTTGATGTTACTGGTACGACGGAATCAGAAGGGCTGGATCAATTATTAAACGCTATCCTTGATACTCATGGGATAGGTTACGGTTATAAAAAAATACATCTCTCAGGCCAAATGTCAGCCCACGCCATAGCGGCGTATGTCAACGAAAAGAGTGGTGTTACTTTCTTCGATCCCAATTTCGGTGAATTCCACTTTTCTGATAAGGAAAAGTTCCGCAAATGGTTTACTAACTCATTCTGGGGTAATTCTATGTATCATTATCCTCTGGGGGTGGGGCAGCGTTTTAGAGTGTTAACATTTGACTCCAAGGAGGTTTAA
    
    #grep "yopT" selected_gtf_files/Yersinia_pestis_3770.gtf
    NZ_CP006750.1   RefSeq  gene    18136   19103   .       +       .       gene_id "M483_RS22135"; transcript_id ""; gbkey "Gene"; gene "yopT"; gene_biotype "pseudogene"; locus_tag "M483_RS22135"; old_locus_tag "M483_4264"; pseudo "true"; 
    
    samtools faidx Yersinia_pestis_3770.fna NZ_CP006750.1:18136-19103 > temp.fna
    sed -i -e ':a;N;$!ba;s/\n//g' -e 's/:/\t/g' temp.fna
    
    Yersinia_pestis_3770    ATGAACAGTATTCACGGACACTACCATATTCAACTATCGAATTATTCTGCCGGTGAAAACCTTCAATCAGTACCCTCACCGAAGGGGTGATTGGCGCACACCGAGTGAAAGTGGAAACAGCACTGTCACACTCAAACCTGCAGAAAAAGTTATCAGCCACCATAAAACATAACCAGTCAGGCCGTTCTATGCTGGATAGAAAGTTGACCAGCGACGGCAAAGCTAACCAACGCAGCAGCTTTACCTTCAGTATGATTATGTATCGCATGATACATTTTGTACTCAGCACTCGTGTGCCCGCGGTGAGAGAGTCTGTTGCAAATTACGGAGGTAACATCAATTTCAAGTTTGCTCAGACCAAAGGGGCTTTTCTTCATAAAATAATAAAACATTCAGACACTGCTAGCGGTGTCTGTGAGGCTTTATGTGCACATTGGATCAGGAACCATGCACAAGGCCAAAGCTTATTTGACCAGCTCTATGTTGGCGGGCGTAAGGGGAAATTCCAGATCGATACACTTTACTCAATTAAACAGTTGCAAATAGATGGTTGTAAAGCAGACGTTGATCAAGATGAGGTAACACTAGATTGGTTCAAGAAAAATGGCATATCAGAACGTATGATTGAACGGCATTGCTTACTGCGTCCAGTTGATGTTACTGGTACGACGGAATCAGAAGGGCTGGATCAATTATTAAACGCTATCCTTGATACTCATGGGATAGGTTACGGTTATAAAAAAATACATCTCTCAGGCCAAATGTCAGCCCACGCCATAGCGGCGTATGTCAACGAAAAGAGTGGTGTTACTTTCTTCGATCCCAATTTCGGTGAATTCCACTTTTCTGATAAGGAAAAGTTCCGCAAATGGTTTACTAACTCATTCTGGGGTAATTCTATGTATCATTATCCTCTGGGGGTGGGGCAGCGTTTTAGAGTGTTAACATTTGACTCCAAGGAGGTTTAA
    
    #grep "yopT" selected_gtf_files/Yersinia_pestis_8787.gtf 
    NZ_CP006747.1   RefSeq  gene    55293   56260   .       +       .       gene_id "M484_RS21915"; transcript_id ""; gbkey "Gene"; gene "yopT"; gene_biotype "pseudogene"; locus_tag "M484_RS21915"; old_locus_tag "M484_4255"; pseudo "true"; 
    
    samtools faidx Yersinia_pestis_8787.fna NZ_CP006747.1:55293-56260 > temp.fna
    sed -i -e ':a;N;$!ba;s/\n//g' -e 's/:/\t/g' temp.fna
    
    Yersinia_pestis_8787    ATGAACAGTATTCACGGACACTACCATATTCAACTATCGAATTATTCTGCCGGTGAAAACCTTCAATCAGTACCCTCACCGAAGGGGTGATTGGCGCACACCGAGTGAAAGTGGAAACAGCACTGTCACACTCAAACCTGCAGAAAAAGTTATCAGCCACCATAAAACATAACCAGTCAGGCCGTTCTATGCTGGATAGAAAGTTGACCAGCGACGGCAAAGCTAACCAACGCAGCAGCTTTACCTTCAGTATGATTATGTATCGCATGATACATTTTGTACTCAGCACTCGTGTGCCCGCGGTGAGAGAGTCTGTTGCAAATTACGGAGGTAACATCAATTTCAAGTTTGCTCAGACCAAAGGGGCTTTTCTTCATAAAATAATAAAACATTCAGACACTGCTAGCGGTGTCTGTGAGGCTTTATGTGCACATTGGATCAGGAACCATGCACAAGGCCAAAGCTTATTTGACCAGCTCTATGTTGGCGGGCGTAAGGGGAAATTCCAGATCGATACACTTTACTCAATTAAACAGTTGCAAATAGATGGTTGTAAAGCAGACGTTGATCAAGATGAGGTAACACTAGATTGGTTCAAGAAAAATGGCATATCAGAACGTATGATTGAACGGCATTGCTTACTGCGTCCAGTTGATGTTACTGGTACGACGGAATCAGAAGGGCTGGATCAATTATTAAACGCTATCCTTGATACTCATGGGATAGGTTACGGTTATAAAAAAATACATCTCTCAGGCCAAATGTCAGCCCACGCCATAGCGGCGTATGTCAACGAAAAGAGTGGTGTTACTTTCTTCGATCCCAATTTCGGTGAATTCCACTTTTCTGATAAGGAAAAGTTCCGCAAATGGTTTACTAACTCATTCTGGGGTAATTCTATGTATCATTATCCTCTGGGGGTGGGGCAGCGTTTTAGAGTGTTAACATTTGACTCCAAGGAGGTTTAA
    
    #grep "yopT" selected_gtf_files/Yersinia_pestis_Pestoides_F.gtf
    NC_009377.1     RefSeq  gene    48563   49530   .       -       .       gene_id "YPDSF_RS23435"; transcript_id ""; gbkey "Gene"; gene "yopT"; gene_biotype "pseudogene"; locus_tag "YPDSF_RS23435"; old_locus_tag "YPDSF_4001"; pseudo "true"; 
    
    samtools faidx Yersinia_pestis_Pestoides_F.fna NC_009377.1:48563-49530 > temp.fna
    revseq
    sed -e ':a;N;$!ba;s/\n//g' -e 's/:/\t/g' 48563-49530.rev > temp_.fna
    
    Yersinia_pestis_Pestoides_F ATGAACAGTATTCACGGACACTACCATATTCAACTATCGAATTATTCTGCCGGTGAAAACCTTCAATCAGTACCCTCACCGAAGGGGTGATTGGCGCACACCGAGTGAAAGTGGAAACAGCACTGTCACACTCAAACCTGCAGAAAAAGTTATCAGCCACCATAAAACATAACCAGTCAGGCCGTTCTATGCTGGATAGAAAGTTGACCAGCGACGGCAAAGCTAACCAACGCAGCAGCTTTACCTTCAGTATGATTATGTATCGCATGATACATTTTGTACTCAGCACTCGTGTGCCCGCGGTGAGAGAGTCTGTTGCAAATTACGGAGGTAACATCAATTTCAAGTTTGCTCAGACCAAAGGGGCTTTTCTTCATAAAATAATAAAACATTCAGACACTGCTAGCGGTGTCTGTGAGGCTTTATGTGCACATTGGATCAGGAACCATGCACAAGGCCAAAGCTTATTTGACCAGCTCTATGTTGGCGGGCGTAAGGGGAAATTCCAGATCGATACACTTTACTCAATTAAACAGTTGCAAATAGATGGTTGTAAAGCAGACGTTGATCAAGATGAGGTAACACTAGATTGGTTCAAGAAAAATGGCATATCAGAACGTATGATTGAACGGCATTGCTTACTGCGTCCAGTTGATGTTACTGGTACGACGGAATCAGAAGGGCTGGATCAATTATTAAACGCTATCCTTGATACTCATGGGATAGGTTACGGTTATAAAAAAATACATCTCTCAGGCCAAATGTCAGCCCACGCCATAGCGGCGTATGTCAACGAAAAGAGTGGTGTTACTTTCTTCGATCCCAATTTCGGTGAATTCCACTTTTCTGATAAGGAAAAGTTCCGCAAATGGTTTACTAACTCATTCTGGGGTAATTCTATGTATCATTATCCTCTGGGGGTGGGGCAGCGTTTTAGAGTGTTAACATTTGACTCCAAGGAGGTTTAA
    
    #grep "yopT" selected_gtf_files/Yersinia_pestis_Pestoides_F_bis.gtf
    NZ_CP009713.1   RefSeq  gene    53246   54213   .       -       .       gene_id "BZ18_RS22165"; transcript_id ""; gbkey "Gene"; gene "yopT"; gene_biotype "pseudogene"; locus_tag "BZ18_RS22165"; old_locus_tag "BZ18_4298"; pseudo "true";
    
    samtools faidx Yersinia_pestis_Pestoides_F_bis.fna NZ_CP009713.1:53246-54213 > temp.fna
    revseq
    sed -e ':a;N;$!ba;s/\n//g' -e 's/:/\t/g' 53246-54213.rev > temp_.fna
    
    Yersinia_pestis_Pestoides_F_bis ATGAACAGTATTCACGGACACTACCATATTCAACTATCGAATTATTCTGCCGGTGAAAACCTTCAATCAGTACCCTCACCGAAGGGGTGATTGGCGCACACCGAGTGAAAGTGGAAACAGCACTGTCACACTCAAACCTGCAGAAAAAGTTATCAGCCACCATAAAACATAACCAGTCAGGCCGTTCTATGCTGGATAGAAAGTTGACCAGCGACGGCAAAGCTAACCAACGCAGCAGCTTTACCTTCAGTATGATTATGTATCGCATGATACATTTTGTACTCAGCACTCGTGTGCCCGCGGTGAGAGAGTCTGTTGCAAATTACGGAGGTAACATCAATTTCAAGTTTGCTCAGACCAAAGGGGCTTTTCTTCATAAAATAATAAAACATTCAGACACTGCTAGCGGTGTCTGTGAGGCTTTATGTGCACATTGGATCAGGAACCATGCACAAGGCCAAAGCTTATTTGACCAGCTCTATGTTGGCGGGCGTAAGGGGAAATTCCAGATCGATACACTTTACTCAATTAAACAGTTGCAAATAGATGGTTGTAAAGCAGACGTTGATCAAGATGAGGTAACACTAGATTGGTTCAAGAAAAATGGCATATCAGAACGTATGATTGAACGGCATTGCTTACTGCGTCCAGTTGATGTTACTGGTACGACGGAATCAGAAGGGCTGGATCAATTATTAAACGCTATCCTTGATACTCATGGGATAGGTTACGGTTATAAAAAAATACATCTCTCAGGCCAAATGTCAGCCCACGCCATAGCGGCGTATGTCAACGAAAAGAGTGGTGTTACTTTCTTCGATCCCAATTTCGGTGAATTCCACTTTTCTGATAAGGAAAAGTTCCGCAAATGGTTTACTAACTCATTCTGGGGTAATTCTATGTATCATTATCCTCTGGGGGTGGGGCAGCGTTTTAGAGTGTTAACATTTGACTCCAAGGAGGTTTAA
    
    #grep "yopT" selected_gtf_files/Yersinia_pestis_Pestoides_G.gtf
    NZ_CP010246.1   RefSeq  gene    1551    2518    .       +       .       gene_id "CH43_RS22165"; transcript_id ""; gbkey "Gene"; gene "yopT"; gene_biotype "pseudogene"; locus_tag "CH43_RS22165"; old_locus_tag "CH43_4244"; pseudo "true"; 
    
    samtools faidx Yersinia_pestis_Pestoides_G.fna NZ_CP010246.1:1551-2518 > temp.fna
    sed -i -e ':a;N;$!ba;s/\n//g' -e 's/:/\t/g' temp.fna
    
    Yersinia_pestis_Pestoides_G ATGAACAGTATTCACGGACACTACCATATTCAACTATCGAATTATTCTGCCGGTGAAAACCTTCAATCAGTACCCTCACCGAAGGGGTGATTGGCGCACACCGAGTGAAAGTGGAAACAGCACTGTCACACTCAAACCTGCAGAAAAAGTTATCAGCCACCATAAAACATAACCAGTCAGGCCGTTCTATGCTGGATAGAAAGTTGACCAGCGACGGCAAAGCTAACCAACGCAGCAGCTTTACCTTCAGTATGATTATGTATCGCATGATACATTTTGTACTCAGCACTCGTGTGCCCGCGGTGAGAGAGTCTGTTGCAAATTACGGAGGTAACATCAATTTCAAGTTTGCTCAGACCAAAGGGGCTTTTCTTCATAAAATAATAAAACATTCAGACACTGCTAGCGGTGTCTGTGAGGCTTTATGTGCACATTGGATCAGGAACCATGCACAAGGCCAAAGCTTATTTGACCAGCTCTATGTTGGCGGGCGTAAGGGGAAATTCCAGATCGATACACTTTACTCAATTAAACAGTTGCAAATAGATGGTTGTAAAGCAGACGTTGATCAAGATGAGGTAACACTAGATTGGTTCAAGAAAAATGGCATATCAGAACGTATGATTGAACGGCATTGCTTACTGCGTCCAGTTGATGTTACTGGTACGACGGAATCAGAAGGGCTGGATCAATTATTAAACGCTATCCTTGATACTCATGGGATAGGTTACGGTTATAAAAAAATACATCTCTCAGGCCAAATGTCAGCCCACGCCATAGCGGCGTATGTCAACGAAAAGAGTGGTGTTACTTTCTTCGATCCCAATTTCGGTGAATTCCACTTTTCTGATAAGGAAAAGTTCCGCAAATGGTTTACTAACTCATTCTGGGGTAATTCTATGTATCATTATCCTCTGGGGGTGGGGCAGCGTTTTAGAGTGTTAACATTTGACTCCAAGGAGGTTTAA
    
    #------------------------------- yopE (+3) -------------------------------
    #grep "yopE" selected_gtf_files/Yersinia_pestis_1522.gtf 
    NZ_CP006757.1   RefSeq  gene    70902   71507   .       -       .       gene_id "M481_RS24690"; transcript_id ""; gbkey "Gene"; gene "yopE"; gene_biotype "pseudogene"; locus_tag "M481_RS24690"; old_locus_tag "M481_4336"; part "2"; pseudo "true"; 
    NZ_CP006757.1   RefSeq  gene    1       53      .       -       .       gene_id "M481_RS24690"; transcript_id ""; gbkey "Gene"; gene "yopE"; gene_biotype "pseudogene"; locus_tag "M481_RS24690"; old_locus_tag "M481_4336"; part "1"; pseudo "true"; 
    
    samtools faidx Yersinia_pestis_1522.fna NZ_CP006757.1:70902-71507 > temp.fna
    samtools faidx Yersinia_pestis_1522.fna NZ_CP006757.1:1-53 >> temp.fna
    #delete the second ">****"
    revseq
    sed -e ':a;N;$!ba;s/\n//g' -e 's/:/\t/g' 70902-71507.rev > temp_.fna
    
    Yersinia_pestis_1522    ATGAAAATATCATCATTTATTTCTACATCACTGCCCCTGCCGACATCTGTGTCGGATCTAGCAGCGTAGGAGAAATGTCTGGGCGCTCAGTCTCACAGCAAACAAGTGATCAATATGCAAACAATCTGGCCGGGCGCACTGAAAGCCCTCAGGGTTCCAGCTTAGCCAGCCGTATCATTGAGAGGTTATCATCAGTGGCCCACTCTGTGATTGGGTTTATCCAACGCATGTTCTCGGAGGGGAGCCATAAACCGGTGGTGACACCAGCACCCACACCTGCACAAATGCCAAGTCCTACGTCTTTCAGTGACAGTATCAAGCAACTTGCTGCTGAGACGCTGCCAAAATACATGCAGCAGTTGAATAGCTTGGATGCAGAGATGCTGCAGAAAAATCATGATCAGTTCGCTACGGGCAGCGGCCCTCTTCGTGGCAGTATCACTCAATGCCAAGGGCTGATGCAGTTTTGTGGTGGGGAATTGCAAGCTGAGGCCAGTGCCATCTTAAACACGCCTGTTTGTGGTATTCCCTTCTCGCAGTGGGGAACTATTGGTGGGGCGGCCAGCGCGTACGTCGCCAGTGGCGTTGATCTAACGCAGGCAGCAAATGAGATCAAAGGGCTGGCGCAACAGATGCAGAAATTACTGTCATTGATGTGA
    
    #grep "yopE" selected_gtf_files/Yersinia_pestis_Nicholisk_41.gtf
    NZ_CP009990.1   RefSeq  gene    67916   68552   .       +       .       gene_id "CH63_RS00620"; transcript_id ""; gbkey "Gene"; gene "yopE"; gene_biotype "protein_coding"; locus_tag "CH63_RS00620"; part "1"; 
    NZ_CP009990.1   RefSeq  gene    1       23      .       +       .       gene_id "CH63_RS00620"; transcript_id ""; gbkey "Gene"; gene "yopE"; gene_biotype "protein_coding"; locus_tag "CH63_RS00620"; part "2"; 
    
    samtools faidx Yersinia_pestis_Nicholisk_41.fna NZ_CP009990.1:67916-68552 > temp.fna
    samtools faidx Yersinia_pestis_Nicholisk_41.fna NZ_CP009990.1:1-23 >> temp.fna
    #delete the second ">****"
    sed -i -e ':a;N;$!ba;s/\n//g' -e 's/:/\t/g' temp.fna
    
    Yersinia_pestis_Nicholisk_41    ATGAAAATATCATCATTTATTTCTACATCACTGCCCCTGCCGACATCTGTGTCAGGATCTAGCAGCGTAGGAGAAATGTCTGGGCGCTCAGTCTCACAGCAAACAAGTGATCAATATGCAAACAATCTGGCCGGGCGCACTGAAAGCCCTCAGGGTTCCAGCTTAGCCAGCCGTATCATTGAGAGGTTATCATCAGTGGCCCACTCTGTGATTGGGTTTATCCAACGCATGTTCTCGGAGGGGAGCCATAAACCGGTGGTGACACCGGCACCCACACCTGCACAAATGCCAAGTCCTACGTCTTTCAGTGACAGTATCAAGCAACTTGCTGCTGAGACGCTGCCAAAATACATGCAGCAGTTGAATAGCTTGGATGCAGAGATGCTGCAGAAAAATCATGATCAGTTCGCTACGGGCAGCGGCCCTCTTCGTGGCAGTATCACTCAATGCCAAGGGCTGATGCAGTTTTGTGGTGGGGAATTGCAAGCTGAGGCCAGTGCCATCTTAAACACGCCTGTTTGTGGTATTCCCTTCTCGCAGTGGGGAACTATTGGTGGGGCGGCCAGCGCGTACGTCGCCAGTGGCGTTGATCTAACGCAGGCAGCAAATGAGATCAAAGGGCTGGCGCAACAGATGCAGAAATTACTGTCATTGATGTGA
    
    #grep "yopE" selected_gtf_files/Yersinia_pseudotuberculosis_FDAARGOS_581.gtf
    NZ_CP033712.1   RefSeq  gene    69663   70035   .       +       .       gene_id "EGX47_RS00005"; transcript_id ""; gbkey "Gene"; gene "yopE"; gene_biotype "protein_coding"; locus_tag "EGX47_RS00005"; old_locus_tag "EGX47_00005"; part "1"; 
    NZ_CP033712.1   RefSeq  gene    1       287     .       +       .       gene_id "EGX47_RS00005"; transcript_id ""; gbkey "Gene"; gene "yopE"; gene_biotype "protein_coding"; locus_tag "EGX47_RS00005"; old_locus_tag "EGX47_00005"; part "2"; 
    
    samtools faidx Yersinia_pseudotuberculosis_FDAARGOS_581.fna NZ_CP033712.1:69663-70035 > temp.fna
    samtools faidx Yersinia_pseudotuberculosis_FDAARGOS_581.fna NZ_CP033712.1:1-287 >> temp.fna
    #delete the second ">****"
    sed -i -e ':a;N;$!ba;s/\n//g' -e 's/:/\t/g' temp.fna
    
    Yersinia_pseudotuberculosis_FDAARGOS_581    ATGAAAATATCATCATTTATTTCTACATCACTGCCCCTGCCGACATCTGTGTCAGGATCTAGCAGCGTAGGAGAAATGTCTGGGCGCTCAGTCTCACAGCAAACAAGTGATCAATATGCAAACAATCTGGCCGGGCGCACTGAAAGCCCTCAGGGTTCCAGCTTAGCCAGCCGTATCATTGAGAGGTTATCATCAGTGGCCCACTCTGTGATTGGGTTTATCCAACGCATGTTCTCGGAGGGGAGCCATAAACCGGTGGTGACACCAGCACCCACACCTGCACAAATGCCAAGTCCTACGTCTTTCAGTGACAGTATCAAGCAACTTGCTGCTGAGACGCTGCCAAAATACATGCAGCAGTTGAATAGCTTGGATGCAGAGATGCTGCAGAAAAATCATGATCAGTTCGCTACGGGCAGCGGCCCTCTTCGTGGCAGTATCACTCAATGCCAAGGGCTGATGCAGTTTTGTGGTGGGGAATTGCAAGCTGAGGCCAGTGCCATCTTAAACACGCCTGTTTGTGGTATTCCCTTCTCGCAGTGGGGAACTATTGGTGGGGCGGCCAGCGCGTACGTCGCCAGTGGCGTTGATCTAACGCAGGCAGCAAATGAGATCAAAGGGCTGGCGCAACAGATGCAGAAATTACTGTCATTGATGTGA
    
    #------------------------------- yopD (+2) -------------------------------
    #grep "yopD" selected_gtf_files/Yersinia_enterocolitica_YE165.gtf
    NZ_CP016933.1   RefSeq  gene    74497   74497   .       +       .       gene_id "BB936_RS22270"; transcript_id ""; gbkey "Gene"; gene "yopD"; gene_biotype "protein_coding"; locus_tag "BB936_RS22270"; old_locus_tag "BB936_22265"; part "1"; 
    NZ_CP016933.1   RefSeq  gene    1       920     .       +       .       gene_id "BB936_RS22270"; transcript_id ""; gbkey "Gene"; gene "yopD"; gene_biotype "protein_coding"; locus_tag "BB936_RS22270"; old_locus_tag "BB936_22265"; part "2"; 
    
    samtools faidx Yersinia_enterocolitica_YE165.fna NZ_CP016933.1:74497-74497 > temp.fna
    samtools faidx Yersinia_enterocolitica_YE165.fna NZ_CP016933.1:1-920 >> temp.fna
    #delete the second ">****"
    sed -i -e ':a;N;$!ba;s/\n//g' -e 's/:/\t/g' temp.fna
    
    Yersinia_enterocolitica_YE165   ATGACAATAAATATCAAGACAGACAGCCCAATTATCACGACCGGTTCACAGCTTGATGCCATCACTACAGAGACAGTCGGGCAAAGCGGTGAGGTTAAAAAAACAGAAGACACCCGTCATGAAGCACAAGCAATAAAGAGTAGCGAGGCAAGCTTATCTCGGTCACAGGTGCCTGAATTGATCAAACCGAGTCAGGGAATCAATGTTGCATTACTGAGTAAAAGCCAGGGAGATCTTAATGGTACTTTAAGTATCTTGTTGTTGCTGTTGGAACTGGCACGTAAAGCGCGAGAAATGGGTTTGCAACAAAGGGATATAGAAAATAAAGCTACTATTTCTGCCCAAAAGGAGCAGGTAGCGGAGATGGTCAGCGGTGCAAAACTGATGATCGCCATGGCGGTGGTGTCTGGCATCATGGCTGCTACTTCTACGGTTGCTAGTGCTTTTTCTATAGCGAAAGAGGTGAAAATAGTTAAACAGGAACAAATTCTAAACAGTAACATTGCCGGCCGTGATCAACTTATTGATACAAAAATGCAGCAAATGAGTAACGCTGGTGATAAAGCGGTAAGCAGAGAGGATATCGGGAGAATATGGAAACCAGAGCAGGTAGCGGATCAAAATAAGCTGGCATTATTGGATAAAGAATTCAGAATGACCGACTCAAAAGCCAATGCGTTTAATGCCGCAACGCAGCCGTTAGGACAAATGGCAAACAGTGCGATTCAAGTTCATCAAGGGTATTCTCAAGCCGAGGTCAAAGAAAAAGAAGTCAATGCAAGTATTGCTGCCAACGAGAAGCAAAAAGCCGAAGAGGCGATGAACTATAATGATAACTTTATGAAAGATGTCCTGCGCTTGATTGAACAATATGTTAGCAGTCATACTCACGCCATGAAAGCCGCTTTTGGTGTTGTCTGA
    
    #grep "yopD" selected_gtf_files/Yersinia_pseudotuberculosis_IP32953_bis.gtf
    NZ_CP009711.1   RefSeq  gene    68202   68525   .       +       .       gene_id "BZ17_RS00160"; transcript_id ""; db_xref "GeneID:66841050"; gbkey "Gene"; gene "yopD"; gene_biotype "protein_coding"; locus_tag "BZ17_RS00160"; part "1"; 
    NZ_CP009711.1   RefSeq  gene    1       597     .       +       .       gene_id "BZ17_RS00160"; transcript_id ""; db_xref "GeneID:66841050"; gbkey "Gene"; gene "yopD"; gene_biotype "protein_coding"; locus_tag "BZ17_RS00160"; part "2"; 
    
    samtools faidx Yersinia_pseudotuberculosis_IP32953_bis.fna NZ_CP009711.1:68202-68525 > temp.fna
    samtools faidx Yersinia_pseudotuberculosis_IP32953_bis.fna NZ_CP009711.1:1-597 >> temp.fna
    #delete the second ">****"
    sed -i -e ':a;N;$!ba;s/\n//g' -e 's/:/\t/g' temp.fna
    
    Yersinia_pseudotuberculosis_IP32953_bis ATGACAATAAATATCAAGACAGACAGCCCAATTATCACGACCGGTTCACAGCTTGATGCCATCACTACAGAGACAGTCAAGCAAAGCGGTGAGATTAAAAAAACAGAAGACACCCGTCATGAAGCACAAGCAATAAAGAGTAGCGAGGCAAGCTTATCTCGGTCACAGGTGCCAGAATTGATCAAACCGAGCCAGGGAATCAATGTTGCATTACTGAGTAAAAGCCAGGGTGATCTTAATGGTACTTTAAGTATCTTGTTGTTGCTGTTGGAACTGGCACGTAAAGCGCGAGAAATGGGTTTGCAACAAAGGGATATAGAAAATAAAGCTACTATTACTGCCCAAAAGGAGCAGGTAGCGGAGATGGTCAGCGGTGCAAAACTGATGATCGCCATGGCGGTGGTGTCTGGCATCATGGCTGCTACTTCTACGGTTGCTAGTGCTTTTTCTATAGCGAAAGAGGTGAAAATAGTTAAACAGGAACAAATTCTAAACAGTAATATTGCTGGCCGCGAACAACTTATTGATACAAAAATGCAGCAAATGAGTAACATTGGTGATAAAGCGGTAAGCAGAGAGGATATCGGGAGAATATGGAAACCAGAGCAGGTAGCGGATCAAAATAAGCTGGCATTATTGGATAAAGAATTCAGAATGACCGACTCAAAAGCCAATGCGTTTAATGCCGCAACGCAGCCGTTAGGACAAATGGCAAACAGTGCGATTCAAGTTCATCAAGGGTATTCTCAAGCCGAGGTCAAAGAGAAAGAAGTCAATGCAAGTATTGCTGCCAACGAGAAGCAAAAAGCCGAAGAGGCGATGAACTATAATGATAACTTTATGAAAGATGTCCTGCGCTTGATTGAACAATATGTTAGCAGTCATACTCACGCCATGAAAGCCGCTTTTGGTGTTGTCTGA
    
    #------------------------------- yopM (+2) -------------------------------
    #grep "yopM" selected_gtf_files/Yersinia_pestis_FDAARGOS_602.gtf
    NZ_CP033695.1   RefSeq  gene    69663   70174   .       +       .       gene_id "EGX42_RS00660"; transcript_id ""; gbkey "Gene"; gene "yopM"; gene_biotype "protein_coding"; locus_tag "EGX42_RS00660"; old_locus_tag "EGX42_00655"; part "1"; 
    NZ_CP033695.1   RefSeq  gene    1       592     .       +       .       gene_id "EGX42_RS00660"; transcript_id ""; gbkey "Gene"; gene "yopM"; gene_biotype "protein_coding"; locus_tag "EGX42_RS00660"; old_locus_tag "EGX42_00655"; part "2"; 
    
    samtools faidx Yersinia_pestis_FDAARGOS_602.fna NZ_CP033695.1:69663-70174 > temp.fna
    samtools faidx Yersinia_pestis_FDAARGOS_602.fna NZ_CP033695.1:1-592 >> temp.fna
    #delete the second ">****"
    sed -i -e ':a;N;$!ba;s/\n//g' -e 's/:/\t/g' temp.fna
    
    Yersinia_pestis_FDAARGOS_602    ATGTTCATAAATCCAAGAAATGTATCTAATACTTTTTTGCAAGAACCATTACGTCATTCTTCTAATTTAACTGAGATGCCGGTTGAGGCAGAAAATGTTAAATCTAAGACTGAATATTATAATGCATGGTCGGAATGGGAACGAAATGCCCCTCCGGGGAATGGTGAACAGAGGGAAATGGCGGTTTCAAGGTTACGAGATTGCCTGGACCGACAAGCCCATGAGCTAGAACTAAATAATCTGGGGCTGAGTTCTTTGCCGGAATTACCTCCGCATTTAGAGAGTTTAGTGGCGTCATGTAATTCTCTTACAGAATTACCGGAATTACCGCAGAGCCTGAAATCACTTCTAGTTGATAATAACAATCTGAAGGCATTATCCGATTTACCACCTTTACTGGAATATTTAGGTGTCTCTAATAATCAGCTGGAAAAATTGCCAGAGTTGCAAAACTCGTCCTTCTTGAAAATTATTGATGTTGATAACAATTCACTGAAAAAACTACCTGATTTACCTCCTTCACTGGAGTTTATTGCTGCTGGTAATAATCAGCTGGAAGAATTGCCAGAGTTGCAAAACTTGCCCTTCTTGACTACGATTTATGCTGATAACAATTTACTGAAAACATTACCCGATTTACCCCCTTCCCTGGAAGCACTTAATGTCAGAGATAATTATTTAACTGATCTGCCAGAATTACCGCAGAGTTTAACCTTCTTAGATGTTTCTGAAAATATTTTTTCTGGATTATCGGAATTGCCACCAAACTTGTATTATCTCAATGCATCCAGCAATGAAATAAGATCCTTATGCGATTTACCCCCTTCACTGGAAGAACTTAATGTCAGTAATAATAAGTTGATCGAACTGCCAGCGTTACCTCCACGCTTAGAACGTTTAATCGCTTCATTTAATCATCTTGCTGAAGTACCTGAATTGCCGCAAAACCTGAAACAGCTCCACGTAGAGTACAACCCTCTGAGAGAGTTTCCCGATATACCTGAGTCAGTGGAAGATCTTCGGATGAACTCTGAACGTGTAGTTGATCCATATGAATTTGCTCATGAGACTACAGACAAACTTGAAGATGATGTATTTGAGTAG
    
    #grep "yopM" selected_gtf_files/Yersinia_pseudotuberculosis_PB1+_bis.gtf
    NZ_CP009779.1   RefSeq  gene    69708   69812   .       +       .       gene_id "BZ16_RS00005"; transcript_id ""; gbkey "Gene"; gene "yopM"; gene_biotype "protein_coding"; locus_tag "BZ16_RS00005"; old_locus_tag "BZ16_4135"; part "1"; 
    NZ_CP009779.1   RefSeq  gene    1       1485    .       +       .       gene_id "BZ16_RS00005"; transcript_id ""; gbkey "Gene"; gene "yopM"; gene_biotype "protein_coding"; locus_tag "BZ16_RS00005"; old_locus_tag "BZ16_4135"; part "2"; 
    
    samtools faidx Yersinia_pseudotuberculosis_PB1+_bis.fna NZ_CP009779.1:69708-69812 > temp.fna
    samtools faidx Yersinia_pseudotuberculosis_PB1+_bis.fna NZ_CP009779.1:1-1485 >> temp.fna
    #delete the second ">****"
    sed -i -e ':a;N;$!ba;s/\n//g' -e 's/:/\t/g' temp.fna
    
    Yersinia_pseudotuberculosis_PB1+_bis    ATGTTCATAAATCCAAGAAATGTATCTAATACTTTTTTGCAAGAACCATTACGTCATTCTTCTAATTTAACTGAGATGCCGGTTGAGGCAGAAAATGTTAAATCTAAGACTGAATATTATAATGCATGGTCGGAATGGGAACGAAATGCCCCTCCGGGGAATGGTGAACAGAGGGAAATGGCGGTTTCAAGGTTACGAGATTGCCTGGACCGACAAGCCCATGAGCTAGAACTAAATAATCTGGGGCTGAGTTCTTTGCCGGAATTACCTCCGCATTTAGAGAGTTTAGTGGCGTCATGTAATTCTCTTACAGAATTACCGGAATTGCCGCAGAGCCTGAAATCACTTCAAGTTGAAAATAACAATCTGAAGGCATTACCCGATTTACCCCCTTCCCTGAAAAAACTTCATGTCAGAGAAAATGATTTAACTGATCTGCCAGAATTACCGCAGAGCCTGGAATCACTTCGAGTTGATAATAACAATCTGAAGGCATTATCCGATTTACCTCCTTCACTGGAATATCTTACTGCTAGTAGTAATAAGCTGGAAGAATTGCCAGAGTTGCAAAACTTGCCCTTCTTGGCTGCGATTTATGCTGATAACAATTTACTGGAAACATTACCCGATTTACCCCCTTCCCTGAAAAAACTTCATGTCAGAGAAAATGATTTAACTGATCTGCCAGAATTACCGCAGAGCCTGGAATCACTTCAAGTTGATAATAACAATCTGAAGGCATTATCCGATTTACCTCCTTCACTGGAATATCTTACTGCTAGTAGTAATAAGCTGGAAGAATTGCCAGAGTTGCAAAACTTGCCCTTCTTGGCTGCGATTTATGCTGATAACAATTTACTGGAAACATTACCCGATTTACCCCCACATTTAGAGATTTTAGTGGCGTCATATAATTCTCTTACTGAATTACCGGAATTGCCGCAGAGCCTGAAATCACTTCGAGTTGATAATAACAATCTGAAGGCATTATCCGATTTACCTCCTTCACTGGAATATCTTACTGCTAGTAGTAATAAGCTGGAAGAATTACCAGAGTTGCAAAACTTGCCCTTCTTGGCTGCGATTTATGCTGATAACAATTTACTGGAAACATTACCCGATTTACCCCCTTCCCTGAAAAAACTTCATGTCAGAGAAAATGATTTAACTGATCTGCCAGAATTACCGCAGAGTTTAACCTTCTTAGATGTTTCTGATAATAATATTTCTGGATTATCGGAATTGCCACCAAACTTGTATTATCTCGATGCATCCAGCAATGAAATAAGATCCTTATGCGATTTACCTCCTTCACTGGTAGACCTTAATGTCAAAAGTAATCAGTTGAGCGAACTGCCAGCGTTACCTCCACACTTAGAACGTTTAATCGCTTCATTTAATTATCTTGCTGAAGTACCTGAATTGCCGCAAAACCTGAAACAGCTCCACGTAGAGCAAAACGCTCTGAGAGAGTTTCCCGATATACCTGAGTCATTGGAAGAGCTTGAGATGGACTCTGAACGTGTAGTTGATCCATATGAATTTGCTCATGAGACTACAGACAAACTTGAAGATGATGTATTTGAGTAG
    
    #------------------------------- yopO (+9) -------------------------------
    #grep "yopO" selected_gtf_files/Yersinia_enterocolitica_YE165.gtf
    NZ_CP016933.1   RefSeq  gene    11705   13893   .       -       .       gene_id "BB936_RS22335"; transcript_id ""; gbkey "Gene"; gene "yopO"; gene_biotype "pseudogene"; gene_synonym "ypkA"; locus_tag "BB936_RS22335"; old_locus_tag "BB936_22330"; pseudo "true"; 
    
    samtools faidx Yersinia_enterocolitica_YE165.fna NZ_CP016933.1:11705-13893 > temp.fna
    revseq
    sed -e ':a;N;$!ba;s/\n//g' -e 's/:/\t/g' 11705-13893.rev > temp_.fna
    
    Yersinia_enterocolitica_YE165   ATGAAAATCATGGGAACTATGCCACCGTCGATCTCCCTCGCTAAAGCTCATGAGCGCATCAGCCAACATTGGCAAAATCCTGTCGGTGAGCTCAATATCGGAGGAAAACGGTATAGAATTATCGATAATCAAGTGCTGCGCTTGAACCCCCACAGTGGTTTTTCTCTCTTTCGAGAAGGGGTTGGTAAGATCTTTTCGGGGAAGATGTTTAACTTTTCAATTGCTCGTAACCTTACTGAGACACTCCATGCAGCCCAGAAAACGACTTCGCAGGAGCTAAGGTCTGATATCCCCAATGTTCTCAGTAATCTCTTTGGAGCCAAGCCACAGACCGAACTGCCGCTGGGTTGGAAAGGGAAGCCTTTGTCAGGAGCTCCGGATCTTGAAGGGATGCGAGTGGCTGAAACCGATAAGTTTGCCGAGGGCGAAAGCCATATTAGTATAATAGAAACTAAGGATAATCAGCGGTTGGTGGCTAAGATTGAACGCTCCATTGCCGAGGGGCATTTGTTCGCAGAACTGGAGGCTTATAAACACATCTATAAAACCGCGGGCAAACATCCTAATCTTGCCAATGTCCATGGCATGGCTGTGGTGCCATACGGTAACCGTAAGGAGGAAGCATTGCTGATGGATGAGGTGGATGGTTGGCGTTGTTCTGACACACTAAGAAGCCTCGCCGATAGCTGGAAGCAAGGAAAGATCAATAGTGAAGCCTACTGGGGAACGATCAAGTTTATTGCCCATCGGCTATTAGATGTAACCAATCACCTTGCCAAGGCAGGGATAGTACATAACGATATCAAACCCGGTAATGTGGTATTTGACCGCGCTAGCGGAGAGCCCGTTGTCATTGATCTAGGATTACACTCTCGTTCAGGGGAACAACCTAAGGGGTTTACAGAATCCTTCAAAGCGCCGGAGCTTGGAGTAGGAAACCTAGGCGCATCAGAAAAGAGCGATGTTTTTCTCGTAGTTTCAACCCTTCTACATGGTATCGAAGGTTTTGAGAAAGATCCGGAGATAAAACCTAATCAAGGACTGAGATCCATTACCTCAGAACCAGCGCACGTAATGGATGAGAATGGTTACCCAATCCATCGACCTGGTATAGCTGGAGTCGAGACAGCCTATACACGCTTCATCACAGACATCCTTGGCGTTTCCGCTGACTCAAGACCTGATTCCAACGAAGCCAGACTCCACGAGTTCTTGAGCGACGGAACTATTGACGAGGAGTCGGCCAAGCAGATCCTAAAAGATACTCTAACCGGAGAAATGAGCCCATTATCTACTGATGTAAGGCGGATAACACCCAAGAAGCTTCGGGAGCTCTCTGATTTGCTTAGGACGCATTTGAGTAGTGCAGCAACTAAGCAATTGGATATGGGGGTGGTTTTGTCGGATCTTGATACCATGTTGGTGACACTCGACAAGGCCGAACGCGAGGGGGAGTAGACAAGGATCAGTTGAAGAGTTTTAACAGTTTGATTCTGAAGACTTACAGCGTGATTGAAGACTATGTCAAAGGCAGAGAAGGGGATACCAAGAGTTCCAGTGCGGAAGTATCCCCCTATCATCGCAGTAACTTTATGCTATCGATCGTCGAGCCTTCACTGCAGAGGATCCAAAAGCATCTGGACCAGACACACTCTTTTTCTGATATCGGTTCACTAGTGCGCGCACATAAGCACCTGGAAACGCTTTTAGAGGTCTTAGTCACCTTGTCACCGCAAGGGCAGCCCGTGTCCTCTGAAACCTACAGCTTCCTGAATCGATTAGCTGAGGCTAAGGTCACCTTGTCGCAGCAATTGGATACTCTCCAGCAGCAGCAGGAGAGTGCGAAAGCGCAACTATCTATTCTGATTAATCGTTCAGGTTCTTGGGCCGATGTTGCTCGTCAGTCCCTGCAGCGTTTTGACAGTACCCGGCCTGTAGTGAAATTCGGCACTGAGCAGTATACCGCAATTCACCGTCAGATGATGGCGGCCCATGCAGCCATTACGCTACAGGAGGTATCGGAGTTTACTGATGATATGCGAAACTTTACAGCGGACTCTATTCCACTACTGATTCGACTTGGACGAAGCAGTTTAATAGATGAGCATTTGGTTGAACAGAGAGAGAAGTTGCGAGAGCTGACGACCATCGCCGAGCGACTGAACCGGTTGGAGCGGGAATGGATGTGA
    
    #grep "yopO" selected_gtf_files/Yersinia_enterocolitica_YE3.gtf
    NZ_CP016943.1   RefSeq  gene    12782   14970   .       -       .       gene_id "BED35_RS00550"; transcript_id ""; gbkey "Gene"; gene "yopO"; gene_biotype "pseudogene"; gene_synonym "ypkA"; locus_tag "BED35_RS00550"; old_locus_tag "BED35_00550"; pseudo "true"; 
    
    samtools faidx Yersinia_enterocolitica_YE3.fna NZ_CP016943.1:12782-14970 > temp.fna
    revseq
    sed -e ':a;N;$!ba;s/\n//g' -e 's/:/\t/g' 12782-14970.rev > temp_.fna
    
    Yersinia_enterocolitica_YE3 ATGAAAATCATGGGAACTATGCCACCGTCGATCTCCCTCGCTAAAGCTCATGAGCGCATCAGCCAACATTGGCAAAATCCTGTCGGTGAGCTCAATATCGGAGGAAAACGGTATAGAATTATCGATAATCAAGTGCTGCGCTTGAACCCCCACAGTGGTTTTTCTCTCTTTCGAGAAGGGGTTGGTAAGATCTTTTCGGGGAAGATGTTTAACTTTTCAATTGCTCGTAACCTTACTGAGACACTCCATGCAGCCCAGAAAACGACTTCGCAGGAGCTAAGGTCTGATATCCCCAATGTTCTCAGTAATCTCTTTGGAGCCAAGCCACAGACCGAACTGCCGCTGGGTTGGAAAGGGAAGCCTTTGTCAGGAGCTCCGGATCTTGAAGGGATGCGAGTGGCTGAAACCGATAAGTTTGCCGAGGGCGAAAGCCATATTAGTATAATAGAAACTAAGGATAATCAGCGGTTGGTGGCTAAGATTGAACGCTCCATTGCCGAGGGGCATTTGTTCGCAGAACTGGAGGCTTATAAACACATCTATAAAACCGCGGGCAAACATCCTAATCTTGCCAATGTCCATGGCATGGCTGTGGTGCCATACGGTAACCGTAAGGAGGAAGCATTGCTGATGGATGAGGTGGATGGTTGGCGTTGTTCTGACACACTAAGAAGCCTCGCCGATAGCTGGAAGCAAGGAAAGATCAATAGTGAAGCCTACTGGGGAACGATCAAGTTTATTGCCCATCGGCTATTAGATGTAACCAATCACCTTGCCAAGGCAGGGATAGTACATAACGATATCAAACCCGGTAATGTGGTATTTGACCGCGCTAGCGGAGAGCCCGTTGTCATTGATCTAGGATTACACTCTCGTTCAGGGGAACAACCTAAGGGGTTTACAGAATCCTTCAAAGCGCCGGAGCTTGGAGTAGGAAACCTAGGCGCATCAGAAAAGAGCGATGTTTTTCTCGTAGTTTCAACCCTTCTACATGGTATCGAAGGTTTTGAGAAAGATCCGGAGATAAAACCTAATCAAGGACTGAGATCCATTACCTCAGAACCAGCGCACGTAATGGATGAGAATGGTTACCCAATCCATCGACCTGGTATAGCTGGAGTCGAGACAGCCTATACACGCTTCATCACAGACATCCTTGGCGTTTCCGCTGACTCAAGACCTGATTCCAACGAAGCCAGACTCCACGAGTTCTTGAGCGACGGAACTATTGACGAGGAGTCGGCCAAGCAGATCCTAAAAGATACTCTAACCGGAGAAATGAGCCCATTATCTACTGATGTAAGGCGGATAACACCCAAGAAGCTTCGGGAGCTCTCTGATTTGCTTAGGACGCATTTGAGTAGTGCAGCAACTAAGCAATTGGATATGGGGGTGGTTTTGTCGGATCTTGATACCATGTTGGTGACACTCGACAAGGCCGAACGCGAGGGGGAGTAGACAAGGATCAGTTGAAGAGTTTTAACAGTTTGATTCTGAAGACTTACAGCGTGATTGAAGACTATGTCAAAGGCAGAGAAGGGGATACCAAGAGTTCCAGTGCGGAAGTATCCCCCTATCATCGCAGTAACTTTATGCTATCGATCGTCGAGCCTTCACTGCAGAGGATCCAAAAGCATCTGGACCAGACACACTCTTTTTCTGATATCGGTTCACTAGTGCGCGCACATAAGCACCTGGAAACGCTTTTAGAGGTCTTAGTCACCTTGTCACCGCAAGGGCAGCCCGTGTCCTCTGAAACCTACAGCTTCCTGAATCGATTAGCTGAGGCTAAGGTCACCTTGTCGCAGCAATTGGATACTCTCCAGCAGCAGCAGGAGAGTGCGAAAGCGCAACTATCTATTCTGATTAATCGTTCAGGTTCTTGGGCCGATGTTGCTCGTCAGTCCCTGCAGCGTTTTGACAGTACCCGGCCTGTAGTGAAATTCGGCACTGAGCAGTATACCGCAATTCACCGTCAGATGATGGCGGCCCATGCAGCCATTACGCTACAGGAGGTATCGGAGTTTACTGATGATATGCGAAACTTTACAGCGGACTCTATTCCACTACTGATTCGACTTGGACGAAGCAGTTTAATAGATGAGCATTTGGTTGAACAGAGAGAGAAGTTGCGAGAGCTGACGACCATCGCCGAGCGACTGAACCGGTTGGAGCGGGAATGGATGTGA
    
    #grep "yopO" selected_gtf_files/Yersinia_enterocolitica_YE6.gtf
    NZ_CP016937.1   RefSeq  gene    4748707 4750895 .       -       .       gene_id "BED33_RS21960"; transcript_id ""; gbkey "Gene"; gene "yopO"; gene_biotype "pseudogene"; gene_synonym "ypkA"; locus_tag "BED33_RS21960"; old_locus_tag "BED33_21960"; pseudo "true"; 
    
    samtools faidx Yersinia_enterocolitica_YE6.fna NZ_CP016937.1:4748707-4750895 > temp.fna
    revseq
    sed -e ':a;N;$!ba;s/\n//g' -e 's/:/\t/g' 4748707-4750895.rev > temp_.fna
    
    Yersinia_enterocolitica_YE6 ATGAAAATCATGGGAACTATGCCACCGTCGATCTCCCTCGCTAAAGCTCATGAGCGCATCAGCCAACATTGGCAAAATCCTGTCGGTGAGCTCAATATCGGAGGAAAACGGTATAGAATTATCGATAATCAAGTGCTGCGCTTGAACCCCCACAGTGGTTTTTCTCTCTTTCGAGAAGGGGTTGGTAAGATCTTTTCGGGGAAGATGTTTAACTTTTCAATTGCTCGTAACCTTACTGAGACACTCCATGCAGCCCAGAAAACGACTTCGCAGGAGCTAAGGTCTGATATCCCCAATGTTCTCAGTAATCTCTTTGGAGCCAAGCCACAGACCGAACTGCCGCTGGGTTGGAAAGGGAAGCCTTTGTCAGGAGCTCCGGATCTTGAAGGGATGCGAGTGGCTGAAACCGATAAGTTTGCCGAGGGCGAAAGCCATATTAGTATAATAGAAACTAAGGATAATCAGCGGTTGGTGGCTAAGATTGAACGCTCCATTGCCGAGGGGCATTTGTTCGCAGAACTGGAGGCTTATAAACACATCTATAAAACCGCGGGCAAACATCCTAATCTTGCCAATGTCCATGGCATGGCTGTGGTGCCATACGGTAACCGTAAGGAGGAAGCATTGCTGATGGATGAGGTGGATGGTTGGCGTTGTTCTGACACACTAAGAAGCCTCGCCGATAGCTGGAAGCAAGGAAAGATCAATAGTGAAGCCTACTGGGGAACGATCAAGTTTATTGCCCATCGGCTATTAGATGTAACCAATCACCTTGCCAAGGCAGGGATAGTACATAACGATATCAAACCCGGTAATGTGGTATTTGACCGCGCTAGCGGAGAGCCCGTTGTCATTGATCTAGGATTACACTCTCGTTCAGGGGAACAACCTAAGGGGTTTACAGAATCCTTCAAAGCGCCGGAGCTTGGAGTAGGAAACCTAGGCGCATCAGAAAAGAGCGATGTTTTTCTCGTAGTTTCAACCCTTCTACATGGTATCGAAGGTTTTGAGAAAGATCCGGAGATAAAACCTAATCAAGGACTGAGATCCATTACCTCAGAACCAGCGCACGTAATGGATGAGAATGGTTACCCAATCCATCGACCTGGTATAGCTGGAGTCGAGACAGCCTATACACGCTTCATCACAGACATCCTTGGCGTTTCCGCTGACTCAAGACCTGATTCCAACGAAGCCAGACTCCACGAGTTCTTGAGCGACGGAACTATTGACGAGGAGTCGGCCAAGCAGATCCTAAAAGATACTCTAACCGGAGAAATGAGCCCATTATCTACTGATGTAAGGCGGATAACACCCAAGAAGCTTCGGGAGCTCTCTGATTTGCTTAGGACGCATTTGAGTAGTGCAGCAACTAAGCAATTGGATATGGGGGTGGTTTTGTCGGATCTTGATACCATGTTGGTGACACTCGACAAGGCCGAACGCGAGGGGGAGTAGACAAGGATCAGTTGAAGAGTTTTAACAGTTTGATTCTGAAGACTTACAGCGTGATTGAAGACTATGTCAAAGGCAGAGAAGGGGATACCAAGAGTTCCAGTGCGGAAGTATCCCCCTATCATCGCAGTAACTTTATGCTATCGATCGTCGAGCCTTCACTGCAGAGGATCCAAAAGCATCTGGACCAGACACACTCTTTTTCTGATATCGGTTCACTAGTGCGCGCACATAAGCACCTGGAAACGCTTTTAGAGGTCTTAGTCACCTTGTCACCGCAAGGGCAGCCCGTGTCCTCTGAAACCTACAGCTTCCTGAATCGATTAGCTGAGGCTAAGGTCACCTTGTCGCAGCAATTGGATACTCTCCAGCAGCAGCAGGAGAGTGCGAAAGCGCAACTATCTATTCTGATTAATCGTTCAGGTTCTTGGGCCGATGTTGCTCGTCAGTCCCTGCAGCGTTTTGACAGTACCCGGCCTGTAGTGAAATTCGGCACTGAGCAGTATACCGCAATTCACCGTCAGATGATGGCGGCCCATGCAGCCATTACGCTACAGGAGGTATCGGAGTTTACTGATGATATGCGAAACTTTACAGCGGACTCTATTCCACTACTGATTCGACTTGGACGAAGCAGTTTAATAGATGAGCATTTGGTTGAACAGAGAGAGAAGTTGCGAGAGCTGACGACCATCGCCGAGCGACTGAACCGGTTGGAGCGGGAATGGATGTGA
    
    #grep "yopO" selected_gtf_files/Yersinia_pestis_790.gtf
    
    #grep "yopO" selected_gtf_files/Yersinia_pestis_FDAARGOS_601.gtf
    NZ_CP033697.1   RefSeq  gene    68815   70300   .       +       .       gene_id "EGX46_RS00005"; transcript_id ""; gbkey "Gene"; gene "yopO"; gene_biotype "protein_coding"; gene_synonym "ypkA"; locus_tag "EGX46_RS00005"; old_locus_tag "EGX46_00005"; part "1"; 
    NZ_CP033697.1   RefSeq  gene    1       713     .       +       .       gene_id "EGX46_RS00005"; transcript_id ""; gbkey "Gene"; gene "yopO"; gene_biotype "protein_coding"; gene_synonym "ypkA"; locus_tag "EGX46_RS00005"; old_locus_tag "EGX46_00005"; part "2";
    
    samtools faidx Yersinia_pestis_FDAARGOS_601.fna NZ_CP033697.1:68815-70300 > temp.fna
    samtools faidx Yersinia_pestis_FDAARGOS_601.fna NZ_CP033697.1:1-713 >> temp.fna
    #delete the second ">****"
    sed -i -e ':a;N;$!ba;s/\n//g' -e 's/:/\t/g' temp.fna
    
    Yersinia_pestis_FDAARGOS_601    ATGAAAAGCGTGAAAATCATGGGAACTATGCCACCGTCGATCTCCCTCGCCAAAGCTCATGAGCGCATCAGCCAACATTGGCAAAATCCTGTCGGTGAGCTCAATATCGGAGGAAAACGGTATAGAATTATCGATAATCAAGTGTTGCGCTTGAACCCCCACAGTGGTTTTTCTCTCTTTCGAGAAGGGGTTGGTAAGATCTTTTCGGGGAAGATGTTTAACTTTTCAATTGCTCGTAACCTTACTGACACACTCCATGCGGCCCAGAAAACGACTTCGCAGGAGCTAAGGTCTGATATCCCCAATGCTCTCAGTAATCTCTTTGGAGCCAAGCCACAGACCGAACTGCCGCTGGGTTGGAAAGGGGAGCCCTTGTCAGGAGCTCCGGATCTTGAAGGGATGCGAGTGGCTGAAACCGATAAGTTTGCCGAGGGCGAAAGCCATATTAGTATAATAGAAACTAAGGATAAGCAGCGGTTGGTAGCTAAGATTGAACGCTCCATTGCCGAGGGGCATTTGTTCGCAGAACTGGAGGCTTATAAACACATCTATAAAACCGCGGGCAAACATCCTAATCTTGCCAATGTTCATGGCATGGCTGTGGTGCCATACGGTAACCGTAAGGAGGAAGCATTGCTGATGGATGAGGTGGATGGTTGGCGTTGTTCTGACACACTAAGAACCCTCGCCGATAGCTGGAAGCAAGGAAAGATCAATAGTGAAGCCTACTGGGGAACGATCAAGTTTATTGCCCATCGGCTATTAGATGTAACCAATCACCTTGCCAAGGCAGGGGTAGTACATAACGATATCAAACCCGGTAATGTGGTATTTGACCGCGCTAGCGGAGAGCCCGTTGTTATTGATCTAGGATTACACTCTCGTTCAGGGGAACAACCTAAGGGGTTTACAGAATCCTTCAAAGCGCCGGAGCTTGGAGTAGGAAACCTAGGCGCATCAGAAAAGAGCGATGTTTTTCTCGTAGTGTCAACCCTTCTACATTGTATCGAAGGTTTTGAGAAAAATCCGGAGATAAAGCCTAATCAAGGACTGAGATTCATTACCTCAGAACCAGCGCACGTAATGGATGAGAATGGTTATCCAATCCATCGACCTGGTATAGCTGGAGTCGAGACAGCCTATACACGCTTCATCACAGACATCCTTGGCGTTTCCGCTGACTCAAGACCTGATTCCAACGAAGCCAGACTCCACGAGTTCTTGAGCGACGGAACTATCGACGAGGAGTCGGCCAAGCAGATCCTAAAAGATACCCTAACCGGAGAAATGAGCCCATTATCTACTGATGTAAGGCGGATAACACCCAAGAAGCTTCGGGAGCTATCTGATTTGCTTAGGACGCATTTGAGCAGTGCAGCAACTAAGCAATTGGATATGGGGGGGGTTTTGTCGGATCTTGATACCATGTTGGTGGCACTCGACAAGGCCGAACGCGAGGGGGGAGTAGACAAGGATCAGTTGAAGAGTTTTAACAGTTTGATTCTGAAGACTTACAGAGTGATTGAAGACTATGTCAAAGGCAGAGAAGGGGATACCAAGAATTCCAGTACGGAAGTATCCCCCTATCATCGCAGTAACTTTATGCTATCGATCGTCGAACCTTCACTGCAGAGGATCCAGAAGCATCTGGACCAGACACACTCTTTTTCTGATATCGGTTCACTAGTGCGCGCACATAAGCACCTGGAAACGCTTTTAGAGGTCTTAGTCACCTTGTCACAGCAAGGGCAGCCCGTGTCCTCTGAAACCTACGGCTTCCTGAATCGATTAACTGAGGCTAAGATCACCTTGTCGCAGCAATTGAATACTCTCCAGCAGCAGCAGGAGAGTGCGAAAGCGCAATTATCTATTCTGATTAATCGTTCAGGTTCTTGGGCCGATGTTGCTCGTCAGTCCCTGCAGCGTTTTGACAGTACCCGGCCTGTAGTGAAATTCGGCACTGAGCAGTATACCGCAATTCACCGTCAGATGATGGCGGCCCATGCAGCTATTACGCTACAGGAGGTATCGGAGTTTACTGATGATATGCGAAACTTTACAGTGGACTCTATTCCACTACTGATTCAACTTGGACGAAGCAGTTTAATGGATGAGCATTTGGTTGAACAGAGAGAAAAGTTGCGAGAGCTGACGACCATCGCCGAGCGACTGAACCGGTTGGAGCGGGAATGGATGTGA
    
    #grep "yopO" selected_gtf_files/Yersinia_pestis_Harbin_35.gtf
    NC_017263.1     RefSeq  gene    49729   51926   .       -       .       gene_id "YPC_RS21300"; transcript_id ""; gbkey "Gene"; gene "yopO"; gene_biotype "pseudogene"; gene_synonym "ypkA"; locus_tag "YPC_RS21300"; pseudo "true"; 
    
    samtools faidx Yersinia_pestis_Harbin_35.fna NC_017263.1:49729-51926 > temp.fna
    revseq
    sed -e ':a;N;$!ba;s/\n//g' -e 's/:/\t/g' 49729-51926.rev > temp_.fna
    
    Yersinia_pestis_Harbin_35   ATGAAAAGCGTGAAAATCATGGGAACTATGCCACCGTCGATCTCCCTCGCCAAAGCTCATGAGCGCATCAGCCAACATTGGCAAAATCCTGTCGGTGAGCTCAATATCGGAGGAAAACGGTATAGAATTATCGATAATCAAGTGTTGCGCTTGAACCCCCACAGTGGTTTTTCTCTCTTTCGAGAAGGGGTTGGTAAGATCTTTTCGGGGAAGATGTTTAACTTTTCAATTGCTCGTAACCTTACTGACACACTCCATGCGGCCCAGAAAACGACTTCGCAGGAGCTAAGGTCTGATATCCCCAATGCTCTCAGTAATCTCTTTGGAGCCAAGCCACAGACCGAACTGCCGCTGGGTTGGAAAGGGGAGCCCTTGTCAGGAGCTCCGGATCTTGAAGGGATGCGAGTGGCTGAAACCGATAAGTTTGCCGAGGGCGAAAGCCATATTAGTATAATAGAAACTAAGGATAAGCAGCGGTTGGTAGCTAAGATTGAACGCTCCATTGCCGAGGGGCATTTGTTCGCAGAACTGGAGGCTTATAAACACATCTATAAAACCGCGGGCAAACATCCTAATCTTGCCAATGTTCATGGCATGGCTGTGGTGCCATACGGTAACCGTAAGGAGGAAGCATTGCTGATGGATGAGGTGGATGGTTGGCGTTGTTCTGACACACTAAGAACCCTCGCCGATAGCTGGAAGCAAGGAAAGATCAATAGTGAAGCCTACTGGGGAACGATCAAGTTTATTGCCCATCGGCTATTAGATGTAACCAATCACCTTGCCAAGGCAGGGGTAGTACATAACGATATCAAACCCGGTAATGTGGTATTTGACCGCGCTAGCGGAGAGCCCGTTGTTATTGATCTAGGATTACACTCTCGTTCAGGGGAACAACCTAAGGGGTTTACAGAATCCTTCAAAGCGCCGGAGCTTGGAGTAGGAAACCTAGGCGCATCAGAAAAGAGCGATGTTTTTCTCGTAGTGTCAACCCTTCTACATTGTATCGAAGGTTTTGAGAAAAATCCGGAGATAAAGCCTAATCAAGGACTGAGATTCATTACCTCAGAACCAGCGCACGTAATGGATGAGAATGGTTATCCAATCCATCGACCTGGTATAGCTGGAGTCGAGACAGCCTATACACGCTTCATCACAGACATCCTTGGCGTTTCCGCTGACTCAAGACCTGATTCCAACGAAGCCAGACTCCACGAGTTCTTGAGCGACGGAACTATCGACGAGGAGTCGGCCAAGCAGATCCTAAAAGATACCCTAACCGGAGAAATGAGCCCATTATCTACTGATGTAAGGCGGATAACACCCAAGAAGCTTCGGGAGCTATCTGATTTGCTTAGGACGCATTTGAGCAGTGCAGCAACTAAGCAATTGGATATGGGGGGGTTTTGTCGGATCTTGATACCATGTTGGTGGCACTCGACAAGGCCGAACGCGAGGGGGGAGTAGACAAGGATCAGTTGAAGAGTTTTAACAGTTTGATTCTGAAGACTTACAGAGTGATTGAAGACTATGTCAAAGGCAGAGAAGGGGATACCAAGAATTCCAGTACGGAAGTATCCCCCTATCATCGCAGTAACTTTATGCTATCGATCGTCGAACCTTCACTGCAGAGGATCCAGAAGCATCTGGACCAGACACACTCTTTTTCTGATATCGGTTCACTAGTGCGCGCACATAAGCACCTGGAAACGCTTTTAGAGGTCTTAGTCACCTTGTCACAGCAAGGGCAGCCCGTGTCCTCTGAAACCTACGGCTTCCTGAATCGATTAACTGAGGCTAAGATCACCTTGTCGCAGCAATTGAATACTCTCCAGCAGCAGCAGGAGAGTGCGAAAGCGCAATTATCTATTCTGATTAATCGTTCAGGTTCTTGGGCCGATGTTGCTCGTCAGTCCCTGCAGCGTTTTGACAGTACCCGGCCTGTAGTGAAATTCGGCACTGAGCAGTATACCGCAATTCACCGTCAGATGATGGCGGCCCATGCAGCTATTACGCTACAGGAGGTATCGGAGTTTACTGATGATATGCGAAACTTTACAGTGGACTCTATTCCACTACTGATTCAACTTGGACGAAGCAGTTTAATGGATGAGCATTTGGTTGAACAGAGAGAAAAGTTGCGAGAGCTGACGACCATCGCCGAGCGACTGAACCGGTTGGAGCGGGAATGGATGTGA
    
    #grep "yopO" selected_gtf_files/Yersinia_pestis_Harbin_35_bis.gtf
    NZ_CP009703.1   RefSeq  gene    55189   57386   .       +       .       gene_id "CH55_RS00985"; transcript_id ""; gbkey "Gene"; gene "yopO"; gene_biotype "pseudogene"; gene_synonym "ypkA"; locus_tag "CH55_RS00985"; old_locus_tag "CH55_4357"; pseudo "true"; 
    
    samtools faidx Yersinia_pestis_Harbin_35_bis.fna NZ_CP009703.1:55189-57386 > temp.fna
    sed -i -e ':a;N;$!ba;s/\n//g' -e 's/:/\t/g' temp.fna
    
    Yersinia_pestis_Harbin_35_bis   ATGAAAAGCGTGAAAATCATGGGAACTATGCCACCGTCGATCTCCCTCGCCAAAGCTCATGAGCGCATCAGCCAACATTGGCAAAATCCTGTCGGTGAGCTCAATATCGGAGGAAAACGGTATAGAATTATCGATAATCAAGTGTTGCGCTTGAACCCCCACAGTGGTTTTTCTCTCTTTCGAGAAGGGGTTGGTAAGATCTTTTCGGGGAAGATGTTTAACTTTTCAATTGCTCGTAACCTTACTGACACACTCCATGCGGCCCAGAAAACGACTTCGCAGGAGCTAAGGTCTGATATCCCCAATGCTCTCAGTAATCTCTTTGGAGCCAAGCCACAGACCGAACTGCCGCTGGGTTGGAAAGGGGAGCCCTTGTCAGGAGCTCCGGATCTTGAAGGGATGCGAGTGGCTGAAACCGATAAGTTTGCCGAGGGCGAAAGCCATATTAGTATAATAGAAACTAAGGATAAGCAGCGGTTGGTAGCTAAGATTGAACGCTCCATTGCCGAGGGGCATTTGTTCGCAGAACTGGAGGCTTATAAACACATCTATAAAACCGCGGGCAAACATCCTAATCTTGCCAATGTTCATGGCATGGCTGTGGTGCCATACGGTAACCGTAAGGAGGAAGCATTGCTGATGGATGAGGTGGATGGTTGGCGTTGTTCTGACACACTAAGAACCCTCGCCGATAGCTGGAAGCAAGGAAAGATCAATAGTGAAGCCTACTGGGGAACGATCAAGTTTATTGCCCATCGGCTATTAGATGTAACCAATCACCTTGCCAAGGCAGGGGTAGTACATAACGATATCAAACCCGGTAATGTGGTATTTGACCGCGCTAGCGGAGAGCCCGTTGTTATTGATCTAGGATTACACTCTCGTTCAGGGGAACAACCTAAGGGGTTTACAGAATCCTTCAAAGCGCCGGAGCTTGGAGTAGGAAACCTAGGCGCATCAGAAAAGAGCGATGTTTTTCTCGTAGTGTCAACCCTTCTACATTGTATCGAAGGTTTTGAGAAAAATCCGGAGATAAAGCCTAATCAAGGACTGAGATTCATTACCTCAGAACCAGCGCACGTAATGGATGAGAATGGTTATCCAATCCATCGACCTGGTATAGCTGGAGTCGAGACAGCCTATACACGCTTCATCACAGACATCCTTGGCGTTTCCGCTGACTCAAGACCTGATTCCAACGAAGCCAGACTCCACGAGTTCTTGAGCGACGGAACTATCGACGAGGAGTCGGCCAAGCAGATCCTAAAAGATACCCTAACCGGAGAAATGAGCCCATTATCTACTGATGTAAGGCGGATAACACCCAAGAAGCTTCGGGAGCTATCTGATTTGCTTAGGACGCATTTGAGCAGTGCAGCAACTAAGCAATTGGATATGGGGGGGTTTTGTCGGATCTTGATACCATGTTGGTGGCACTCGACAAGGCCGAACGCGAGGGGGGAGTAGACAAGGATCAGTTGAAGAGTTTTAACAGTTTGATTCTGAAGACTTACAGAGTGATTGAAGACTATGTCAAAGGCAGAGAAGGGGATACCAAGAATTCCAGTACGGAAGTATCCCCCTATCATCGCAGTAACTTTATGCTATCGATCGTCGAACCTTCACTGCAGAGGATCCAGAAGCATCTGGACCAGACACACTCTTTTTCTGATATCGGTTCACTAGTGCGCGCACATAAGCACCTGGAAACGCTTTTAGAGGTCTTAGTCACCTTGTCACAGCAAGGGCAGCCCGTGTCCTCTGAAACCTACGGCTTCCTGAATCGATTAACTGAGGCTAAGATCACCTTGTCGCAGCAATTGAATACTCTCCAGCAGCAGCAGGAGAGTGCGAAAGCGCAATTATCTATTCTGATTAATCGTTCAGGTTCTTGGGCCGATGTTGCTCGTCAGTCCCTGCAGCGTTTTGACAGTACCCGGCCTGTAGTGAAATTCGGCACTGAGCAGTATACCGCAATTCACCGTCAGATGATGGCGGCCCATGCAGCTATTACGCTACAGGAGGTATCGGAGTTTACTGATGATATGCGAAACTTTACAGTGGACTCTATTCCACTACTGATTCAACTTGGACGAAGCAGTTTAATGGATGAGCATTTGGTTGAACAGAGAGAAAAGTTGCGAGAGCTGACGACCATCGCCGAGCGACTGAACCGGTTGGAGCGGGAATGGATGTGA
    
    #grep "yopO" selected_gtf_files/Yersinia_pestis_Java9.gtf
    NZ_CP009995.1   RefSeq  gene    76131   77073   .       -       .       gene_id "CH62_RS22640"; transcript_id ""; gbkey "Gene"; gene "yopO"; gene_biotype "protein_coding"; gene_synonym "ypkA"; locus_tag "CH62_RS22640"; part "2";
    NZ_CP009995.1   RefSeq  gene    1       1256    .       -       .       gene_id "CH62_RS22640"; transcript_id ""; gbkey "Gene"; gene "yopO"; gene_biotype "protein_coding"; gene_synonym "ypkA"; locus_tag "CH62_RS22640"; part "1"; 
    
    samtools faidx Yersinia_pestis_Java9.fna NZ_CP009995.1:76131-77073 > temp.fna
    samtools faidx Yersinia_pestis_Java9.fna NZ_CP009995.1:1-1256 >> temp.fna
    #delete the second ">****"
    revseq
    sed -e ':a;N;$!ba;s/\n//g' -e 's/:/\t/g' 76131-77073.rev > temp_.fna
    
    Yersinia_pestis_Java9   ATGAAAAGCGTGAAAATCATGGGAACTATGCCACCGTCGATCTCCCTCGCCAAAGCTCATGAGCGCATCAGCCAACATTGGCAAAATCCTGTCGGTGAGCTCAATATCGGAGGAAAACGGTATAGAATTATCGATAATCAAGTGTTGCGCTTGAACCCCCACAGTGGTTTTTCTCTCTTTCGAGAAGGGGTTGGTAAGATCTTTTCGGGGAAGATGTTTAACTTTTCAATTGCTCGTAACCTTACTGACACACTCCATGCGGCCCAGAAAACGACTTCGCAGGAGCTAAGGTCTGATATCCCCAATGCTCTCAGTAATCTCTTTGGAGCCAAGCCACAGACCGAACTGCCGCTGGGTTGGAAAGGGGAGCCCTTGTCAGGAGCTCCGGATCTTGAAGGGATGCGAGTGGCTGAAACCGATAAGTTTGCCGAGGGCGAAAGCCATATTAGTATAATAGAAACTAAGGATAAGCAGCGGTTGGTAGCTAAGATTGAACGCTCCATTGCCGAGGGGCATTTGTTCGCAGAACTGGAGGCTTATAAACACATCTATAAAACCGCGGGCAAACATCCTAATCTTGCCAATGTTCATGGCATGGCTGTGGTGCCATACGGTAACCGTAAGGAGGAAGCATTGCTGATGGATGAGGTGGATGGTTGGCGTTGTTCTGACACACTAAGAACCCTCGCCGATAGCTGGAAGCAAGGAAAGATCAATAGTGAAGCCTACTGGGGAACGATCAAGTTTATTGCCCATCGGCTATTAGATGTAACCAATCACCTTGCCAAGGCAGGGGTAGTACATAACGATATCAAACCCGGTAATGTGGTATTTGACCGCGCTAGCGGAGAGCCCGTTGTTATTGATCTAGGATTACACTCTCGTTCAGGGGAACAACCTAAGGGGTTTACAGAATCCTTCAAAGCGCCGGAGCTTGGAGTAGGAAACCTAGGCGCATCAGAAAAGAGCGATGTTTTTCTCGTAGTGTCAACCCTTCTACATTGTATCGAAGGTTTTGAGAAAAATCCGGAGATAAAGCCTAATCAAGGACTGAGATTCATTACCTCAGAACCAGCGCACGTAATGGATGAGAATGGTTATCCAATCCATCGACCTGGTATAGCTGGAGTCGAGACAGCCTATACACGCTTCATCACAGACATCCTTGGCGTTTCCGCTGACTCAAGACCTGATTCCAACGAAGCCAGACTCCACGAGTTCTTGAGCGACGGAACTATCGACGAGGAGTCGGCCAAGCAGATCCTAAAAGATACCCTAACCGGAGAAATGAGCCCATTATCTACTGATGTAAGGCGGATAACACCCAAGAAGCTTCGGGAGCTATCTGATTTGCTTAGGACGCATTTGAGCAGTGCAGCAACTAAGCAATTGGATATGGGGGGGGTTTTGTCGGATCTTGATACCATGTTGGTGGCACTCGACAAGGCCGAACGCGAGGGGGGAGTAGACAAGGATCAGTTGAAGAGTTTTAACAGTTTGATTCTGAAGACTTACAGAGTGATTGAAGACTATGTCAAAGGCAGAGAAGGGGATACCAAGAATTCCAGTACGGAAGTATCCCCCTATCATCGCAGTAACTTTATGCTATCGATCGTCGAACCTTCACTGCAGAGGATCCAGAAGCATCTGGACCAGACACACTCTTTTTCTGATATCGGTTCACTAGTGCGCGCACATAAGCACCTGGAAACGCTTTTAGAGGTCTTAGTCACCTTGTCACAGCAAGGGCAGCCCGTGTCCTCTGAAACCTACGGCTTCCTGAATCGATTAACTGAGGCTAAGATCACCTTGTCGCAGCAATTGAATACTCTCCAGCAGCAGCAGGAGAGTGCGAAAGCGCAATTATCTATTCTGATTAATCGTTCAGGTTCTTGGGCCGATGTTGCTCGTCAGTCCCTGCAGCGTTTTGACAGTACCCGGCCTGTAGTGAAATTCGGCACTGAGCAGTATACCGCAATTCACCGTCAGATGATGGCGGCCCATGCAGCTATTACGCTACAGGAGGTATCGGAGTTTACTGATGATATGCGAAACTTTACAGTGGACTCTATTCCACTACTGATTCAACTTGGACGAAGCAGTTTAATGGATGAGCATTTGGTTGAACAGAGAGAAAAGTTGCGAGAGCTGACGACCATCGCCGAGCGACTGAACCGGTTGGAGCGGGAATGGATGTGA
    
    #grep "yopO" selected_gtf_files/Yersinia_pestis_Nicholisk_41.gtf
    NZ_CP009990.1   RefSeq  gene    47448   49645   .       -       .       gene_id "CH63_RS00925"; transcript_id ""; gbkey "Gene"; gene "yopO"; gene_biotype "pseudogene"; gene_synonym "ypkA"; locus_tag "CH63_RS00925"; old_locus_tag "CH63_4306"; pseudo "true"; 
    
    samtools faidx Yersinia_pestis_Nicholisk_41.fna NZ_CP009990.1:47448-49645 > temp.fna
    revseq
    sed -e ':a;N;$!ba;s/\n//g' -e 's/:/\t/g' 47448-49645.rev > temp_.fna
    
    Yersinia_pestis_Nicholisk_41    ATGAAAAGCGTGAAAATCATGGGAACTATGCCACCGTCGATCTCCCTCGCCAAAGCTCATGAGCGCATCAGCCAACATTGGCAAAATCCTGTCGGTGAGCTCAATATCGGAGGAAAACGGTATAGAATTATCGATAATCAAGTGTTGCGCTTGAACCCCCACAGTGGTTTTTCTCTCTTTCGAGAAGGGGTTGGTAAGATCTTTTCGGGGAAGATGTTTAACTTTTCAATTGCTCGTAACCTTACTGACACACTCCATGCGGCCCAGAAAACGACTTCGCAGGAGCTAAGGTCTGATATCCCCAATGCTCTCAGTAATCTCTTTGGAGCCAAGCCACAGACCGAACTGCCGCTGGGTTGGAAAGGGGAGCCCTTGTCAGGAGCTCCGGATCTTGAAGGGATGCGAGTGGCTGAAACCGATAAGTTTGCCGAGGGCGAAAGCCATATTAGTATAATAGAAACTAAGGATAAGCAGCGGTTGGTAGCTAAGATTGAACGCTCCATTGCCGAGGGGCATTTGTTCGCAGAACTGGAGGCTTATAAACACATCTATAAAACCGCGGGCAAACATCCTAATCTTGCCAATGTTCATGGCATGGCTGTGGTGCCATACGGTAACCGTAAGGAGGAAGCATTGCTGATGGATGAGGTGGATGGTTGGCGTTGTTCTGACACACTAAGAACCCTCGCCGATAGCTGGAAGCAAGGAAAGATCAATAGTGAAGCCTACTGGGGAACGATCAAGTTTATTGCCCATCGGCTATTAGATGTAACCAATCACCTTGCCAAGGCAGGGGTAGTACATAACGATATCAAACCCGGTAATGTGGTATTTGACCGCGCTAGCGGAGAGCCCGTTGTTATTGATCTAGGATTACACTCTCGTTCAGGGGAACAACCTAAGGGGTTTACAGAATCCTTCAAAGCGCCGGAGCTTGGAGTAGGAAACCTAGGCGCATCAGAAAAGAGCGATGTTTTTCTCGTAGTGTCAACCCTTCTACATTGTATCGAAGGTTTTGAGAAAAATCCGGAGATAAAGCCTAATCAAGGACTGAGATTCATTACCTCAGAACCAGCGCACGTAATGGATGAGAATGGTTATCCAATCCATCGACCTGGTATAGCTGGAGTCGAGACAGCCTATACACGCTTCATCACAGACATCCTTGGCGTTTCCGCTGACTCAAGACCTGATTCCAACGAAGCCAGACTCCACGAGTTCTTGAGCGACGGAACTATCGACGAGGAGTCGGCCAAGCAGATCCTAAAAGATACCCTAACCGGAGAAATGAGCCCATTATCTACTGATGTAAGGCGGATAACACCCAAGAAGCTTCGGGAGCTATCTGATTTGCTTAGGACGCATTTGAGCAGTGCAGCAACTAAGCAATTGGATATGGGGGGGTTTTGTCGGATCTTGATACCATGTTGGTGGCACTCGACAAGGCCGAACGCGAGGGGGGAGTAGACAAGGATCAGTTGAAGAGTTTTAACAGTTTGATTCTGAAGACTTACAGAGTGATTGAAGACTATGTCAAAGGCAGAGAAGGGGATACCAAGAATTCCAGTACGGAAGTATCCCCCTATCATCGCAGTAACTTTATGCTATCGATCGTCGAACCTTCACTGCAGAGGATCCAGAAGCATCTGGACCAGACACACTCTTTTTCTGATATCGGTTCACTAGTGCGCGCACATAAGCACCTGGAAACGCTTTTAGAGGTCTTAGTCACCTTGTCACAGCAAGGGCAGCCCGTGTCCTCTGAAACCTACGGCTTCCTGAATCGATTAACTGAGGCTAAGATCACCTTGTCGCAGCAATTGAATACTCTCCAGCAGCAGCAGGAGAGTGCGAAAGCGCAATTATCTATTCTGATTAATCGTTCAGGTTCTTGGGCCGATGTTGCTCGTCAGTCCCTGCAGCGTTTTGACAGTACCCGGCCTGTAGTGAAATTCGGCACTGAGCAGTATACCGCAATTCACCGTCAGATGATGGCGGCCCATGCAGCTATTACGCTACAGGAGGTATCGGAGTTTACTGATGATATGCGAAACTTTACAGTGGACTCTATTCCACTACTGATTCAACTTGGACGAAGCAGTTTAATGGATGAGCATTTGGTTGAACAGAGAGAAAAGTTGCGAGAGCTGACGACCATCGCCGAGCGACTGAACCGGTTGGAGCGGGAATGGATGTGA
  6. manually correct point-nt-errors in the sequences according to _seq_additional.aln and then added the corrected sequences to _seq.txt (time-consuming)

    for yop in yopJ yopB yopT yopE yopD yopM yopK yopO yopH; do
        grep "Yersinia_enterocolitica_WA" ${yop}_seq.txt > ${yop}_seq_additional.fasta
    done
    
    for yop in yopJ yopB yopT yopE yopD yopM yopK yopO yopH; do
        mafft --adjustdirection --clustalout ${yop}_seq_additional.fasta > ${yop}_seq_additional.aln
    done
  7. from ${yop}_seq.txt –> ${yop}_protein.fasta –> ${yop}_aligned_protein.fasta

    cd data/yop_files
    for yop in yopJ yopB yopT yopE yopD yopM yopK yopO yopH; do
        python3 txt_to_protein.py ${yop}_seq.txt ${yop}_protein.fasta
    done
    for yop in yopJ yopB yopT yopE yopD yopM yopK yopO yopH; do
        #NOTE: sometimes the alignment didn't work well since the manually added sequences missing bases!
        python3 protein_alignment.py ${yop}_protein.fasta ${yop}_aligned_protein.fasta mafft
        #awk -F '_' '/^>/ { printf(">%s", $3); for (i = 4; i <= NF; ++i) printf("_%s", $i); printf("\n"); next } { print }' ${yop}_aligned_protein.fasta > ${yop}_aligned_protein_.fasta
    done
    
    conda install mamba -c conda-forge  #-n base
    mamba env create -f environment.yml
    
    grep ">" yopB_seq.txt | wc -l
    67 --> 73
    grep ">" yopJ_seq.txt | wc -l  #*
    67 --> 72
    grep ">" yopT_seq.txt | wc -l
    64 --> 73
    grep ">" yopE_seq.txt | wc -l
    70 --> 73
    grep ">" yopD_seq.txt | wc -l
    71 --> 73
    grep ">" yopM_seq.txt | wc -l
    70 --> 71 --> 73
    grep ">"  yopK_seq.txt  | wc -l
    73
    grep ">" yopO_seq.txt | wc -l  #*
    64 --> 72
    grep ">" yopH_seq.txt | wc -l
    73
  8. cluster all sequences in yopM_aligned_protein.fasta, all 100% identital sequences will in a group clustered. For each cluster, output a record as representative. Give a table for All members of groups.

    for yop in yopJ yopB yopT yopE yopD yopM yopK yopO yopH; do
      usearch -cluster_fast ${yop}_aligned_protein.fasta -id 1.0 -centroids ${yop}_clustered.fasta -uc ${yop}_clusters.uc;
    done
    for yop in yopJ yopB yopT yopE yopD yopM yopK yopO yopH; do
      #parse the output of usarch to give a list a members for each class.
      python3 ~/Scripts/yop_analysis/parse_uc_file.py ${yop}_clusters.uc > ${yop}_clusters.txt
      sed -i "s/Members: \['//g" ${yop}_clusters.txt
      sed -i "s/'\]//g" ${yop}_clusters.txt
      sed -i "s/', '/, /g" ${yop}_clusters.txt
      sed -i "s/, /,/g" ${yop}_clusters.txt
      cut -d',' -f2- ${yop}_clusters.txt | sort > ${yop}_clusters_.txt
    done
    
    ~/Tools/csv2xls-0.4/csv_to_xls.py yopJ_clusters_.txt yopB_clusters_.txt yopT_clusters_.txt yopE_clusters_.txt yopD_clusters_.txt yopM_clusters_.txt yopK_clusters_.txt yopO_clusters_.txt yopH_clusters_.txt -o yop_clusters.xls
    
    for yop in yopJ yopB yopT yopE yopD yopM yopK yopO yopH; do
      python3 protein_alignment.py ${yop}_clustered.fasta ${yop}_clustered_aligned_protein.fasta mafft
    done
    for yop in yopJ yopB yopT yopE yopD yopM yopK yopO yopH; do
      python3 sort_fasta2.py ${yop}_clustered_aligned_protein.fasta ${yop}_sorted_selected_aligned_protein.fasta
    done
  9. draw alignments

    library(ggmsa)
    library(ggplot2)
    library(ggtree)
    #library(gggenes)
    library(ape)
    library(Biostrings)
    library(ggnewscale)
    library(dplyr)
    library(ggtreeExtra)
    library(phangorn)
    library(RColorBrewer)
    library(patchwork)
    library(ggplotify)
    library(aplot)
    library(magick)
    library(treeio)
    
    #219 --> 5
    data <- "yopE_sorted_selected_aligned_protein.fasta"
    tidymsa <- tidy_msa(data)
    png("alignment_yopE.png", width=1100, height=800*1.2)
    msa_plot <- ggplot() +
    geom_msa(data = tidymsa, char_width = 0.5, seq_name = TRUE, show.legend = TRUE) + theme_msa() + facet_msa(50)
    msa_plot
    dev.off()
    
    #288 --> 6
    data <- "yopJ_sorted_selected_aligned_protein.fasta"
    tidymsa <- tidy_msa(data)
    png("alignment_yopJ.png", width=1100, height=192*6)
    msa_plot <- ggplot() +
    geom_msa(data = tidymsa, char_width = 0.5, seq_name = TRUE, show.legend = TRUE) + theme_msa() + facet_msa(50)
    msa_plot
    dev.off()
    
    #306 --> 7
    data <- "yopD_sorted_selected_aligned_protein.fasta"
    tidymsa <- tidy_msa(data)
    png("alignment_yopD.png", width=1100, height=192*6)
    msa_plot <- ggplot() +
    geom_msa(data = tidymsa, char_width = 0.5, seq_name = TRUE, show.legend = TRUE) + theme_msa() + facet_msa(50)
    msa_plot
    dev.off()
    
    #529 --> 11
    data <- "yopM_sorted_selected_aligned_protein.fasta"
    tidymsa <- tidy_msa(data)
    png("alignment_yopM.png", width=1100, height=192*12)
    msa_plot <- ggplot() +
    geom_msa(data = tidymsa, char_width = 0.5, seq_name = TRUE, show.legend = TRUE) + theme_msa() + facet_msa(50)
    msa_plot
    dev.off()
    
    #182 --> 4
    data <- "yopK_sorted_selected_aligned_protein.fasta"
    tidymsa <- tidy_msa(data)
    png("alignment_yopK.png", width=1100, height=192*4)
    msa_plot <- ggplot() +
    geom_msa(data = tidymsa, char_width = 0.5, seq_name = TRUE, show.legend = TRUE) + theme_msa() + facet_msa(50)
    msa_plot
    dev.off()
    
    #732 --> 15
    data <- "yopO_sorted_selected_aligned_protein.fasta"
    tidymsa <- tidy_msa(data)
    png("alignment_yopO.png", width=1100, height=192*15)
    msa_plot <- ggplot() +
    geom_msa(data = tidymsa, char_width = 0.5, seq_name = TRUE, show.legend = TRUE) + theme_msa() + facet_msa(50)
    msa_plot
    dev.off()
    
    # -- RERUN due to the one-letter-in-last-line Bug 
    #401 --> 9 --> 8
    data <- "yopB_sorted_selected_aligned_protein.fasta"
    tidymsa <- tidy_msa(data)
    png("alignment_yopB.png", width=1100, height=192*8)
    msa_plot <- ggplot() +
    geom_msa(data = tidymsa, char_width = 0.5, seq_name = TRUE, show.legend = TRUE) + theme_msa() + facet_msa(51)
    msa_plot
    dev.off()
    
    # -- RERUN due to Error in tidy_msa(data) : Sequences must have unique names --
    #322 --> 7 --> delete the repeated Yersinia_pestis_D182038 --> merge the two partial CDS into one
    data <- "yopT_sorted_selected_aligned_protein.fasta"
    tidymsa <- tidy_msa(data)
    png("alignment_yopT.png", width=1100, height=192*8)
    msa_plot <- ggplot() +
    geom_msa(data = tidymsa, char_width = 0.5, seq_name = TRUE, show.legend = TRUE) + theme_msa() + facet_msa(50)
    msa_plot
    dev.off()
    
    #468 --> 10 --> delete the repeated Yersinia_enterocolitica_YE6
    data <- "yopH_sorted_selected_aligned_protein.fasta"
    tidymsa <- tidy_msa(data)
    png("alignment_yopH.png", width=1100, height=192*10)
    msa_plot <- ggplot() +
    geom_msa(data = tidymsa, char_width = 0.5, seq_name = TRUE, show.legend = TRUE) + theme_msa() + facet_msa(50)
    msa_plot
    dev.off()
  10. blast search and mauve analysis (mauve should be opened under bengal3_ac3)

    makeblastdb -in Yersinia_pestis_790.fna -dbtype nucl
    blastn -query yopJ_WA.fasta  -db Yersinia_pestis_790.fna -out yopJ_WA_on_790.txt
    blastn -query yopO_WA.fasta  -db Yersinia_pestis_790.fna -out yopO_WA_on_790.txt