gene_x 0 like s 276 view s
Tags: pipeline
Bioinformatics Methods
Sequence Extraction and Comparative Analysis
Genomic regions surrounding the multidrug efflux MFS transporter were extracted from five isolates of Acinetobacter baumannii: ATCC19606, ACICU, AYE, ATCC17978, and SDF. These regions were aligned to identify any unmapped segments. This alignment revealed a 45 kb sequence (positions 2,944,541 to 2,989,666 in ATCC17978) that did not align with any part of the other four genomes. This unmapped sequence was subsequently extracted for further analysis.
Insertion Sequence (IS) Element Detection
The tool ISEScan (Xie & Tang, 2017) was employed to detect insertion sequences (IS) within the extracted 45 kb fragment. The analysis identified four distinct IS elements, providing insights into potential genomic variations and mobile genetic elements within this segment (see IS_of_45kb.xlsx).
Visualization and Annotation
The alignment was visualized using clinker v0.0.21 (Gilchrist and Chooi, 2021). Genes are represented by arrows colored according to similarity groups, with grey arrows indicating genes not part of any similarity group. The sequence identity between genes in the same similarity group is indicated by shading, according to the identity scale bar. Detailed gene annotations are shown and color-coded, with the multidrug efflux MFS transporter loci indicated by a green dashed box (see clinker1.png).
References
Tools of Insertion Sequence (IS) Element Detection
ISEScan: Description: Although not a database, ISEScan is a software tool used to identify IS elements in bacterial genome sequences. It can be helpful for researchers looking to analyze newly sequenced genomes for the presence of IS elements. Website: Available on platforms like GitHub for download and integration into bioinformatics workflows.
TnCentral including ISFinder: Description: TnCentral is a more comprehensive resource that includes information about transposons, which are larger and more complex than simple IS elements but often contain IS sequences as part of their structure. This database provides detailed information about transposon structures, including associated genes and regulatory features.
ISsaga: ISsaga is a web-based tool for the identification and annotation of insertion sequences in prokaryotic genomes. It provides various features for IS element analysis, including detection, classification, and visualization. You can access ISsaga here: ISsaga #http://issaga.biotoul.fr/ISsaga/issaga_index.php
ISFinder: ISFinder is a curated database and analysis platform for insertion sequences in prokaryotic genomes. It provides a comprehensive collection of IS sequences and tools for sequence analysis, classification, and annotation. You can access ISFinder here: ISFinder For ISfinder please cite: Siguier P. et al. (2006) ISfinder: the reference centre for bacterial insertion sequences. Nucleic Acids Res. 34: D32-D36 (link pubmed) and the database URL (http://www-is.biotoul.fr).
For ISbrowser please cite: Kichenaradja P. et al. (2010) ISbrowser: an extension of ISfinder for visualizing insertion sequences in prokaryotic genomes. Nucleic Acids Res. 38: D62-D68 (link pubmed). For ISsaga please cite: Varani A. et al. (2011) ISsaga: an ensemble of web-based methods for high throughput identification and semi-automatic annotation of insertion sequences in prokaryotic genomes, Genome Biology 2011, 12:R30 (link pubmed).
ISMapper: ISMapper is a tool for mapping insertion sequences in bacterial genomes. It uses paired-end sequence data to identify IS element insertion sites and provides information about their genomic context. You can access ISMapper here: ISMapper ISMapper: identifying transposase insertion sites in bacterial genomes from short read sequence data https://pubmed.ncbi.nlm.nih.gov/26336060/
ISseeker: ISseeker is a software package for the identification and annotation of insertion sequences in bacterial genomes. It provides a user-friendly interface for IS element detection and characterization. You can access ISseeker here: ISseeker
Identificatio of IS (Insertion Sequence) elements using ISEScan
#extracted sequence segments from the two isolates, specifically:
# ATCC19606: 930469 to 951674 — segment1
# ATCC17978: 2,934,384 to 3,000,721 — segment2
#Then, I compared the two segments and found that positions 1-11055 of segment1 mapped to 66338-55284 of segment2, and positions 11049-21206 of segment1 mapped to 10158-23 of segment2. This means the sequence from 10159-55283 of segment2 (about 45 kb nt) is not mapped. I then extracted the 45 kb sequence (see the attached fasta file). I attempted to detect IS elements using the tool ISEScan (https://academic.oup.com/bioinformatics/article/33/21/3340/3930124). Four ISs were detected (see 45kb.fasta.xlsx; for more detailed results, see 45kb.fasta.zip).
#samtools faidx Acinetobacter_baumannii_ATCC19606.gbk_converted.fna CP059040.1:930469-951674 > ../ATCC19606_segment.fasta
vim ./gbks/A.baumannii_ATCC17978.gbk
#LOCUS CP000521 3976747 bp DNA circular BCT 31-JAN-2014
#DEFINITION Acinetobacter baumannii ATCC 17978, complete genome.
I used the following commands extracted a 45kb fasta. Then using a tools get IS elements.
samtools faidx A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2934384-3000721 > ../ATCC17978_segment.fasta
makeblastdb -in ATCC17978_segment.fasta -dbtype nucl
blastn -db ATCC17978_segment.fasta -query ATCC19606_segment.fasta -num_threads 15 -outfmt 6 -strand both -evalue 0.1 > ATCC19606_segment_on_ATCC17978_segment.blastn
samtools faidx ATCC17978_segment.fasta CP000521.1_2934384_3000721:10159-55283 > 45kb.fasta
please update the following tables in which all positons referred to the 45kb sequence to the complete genome in ATCC17978.
#seqID: sequence identifier
#family: family name of IS element
#cluster: Tpase cluster
#isBegin and isEnd: genome coordinates of the predicted IS element
#isLen: length of the predicted IS element
#ncopy4is: number of predicted IS copies including full-length and partial IS copies
#start1, end1, start2, end2: genome coordinates of the IRs
#score: score of the IRs
#irId: number of identical matches in pairwise alignment of left and righ hand invered repeats
#irLen, length of inverted repeats
#nGaps: number of gaps in IRs
#orfBegin, orfEnd: genome coordinates of the predicted Tpase ORF
#strand: strand where the Tpase is
#orfLen: length of predicted Tpase ORF
#E-value: the best E-value among all IS copies for the same IS element, the smaller the better
#E-value4copy: the E-value of the reported IS copy, the smaller the better
#type: type of IS element copy, 'c' for complete IS element and 'p' for partial IS element
#ov: ov number returned by hmmer search
#tir: terminal inverted repeat sequences
seqID family cluster isBegin isEnd isLen ncopy4is start1 end1 start2 end2 score irId irLen nGaps orfBegin orfEnd strand orfLen E-value E-value4copy type ov tir
CP000521.1_2934384_3000721:10159-55283 IS5 IS5_222 5818 8737 2920 1 5818 5842 8713 8737 18 17 25 0 5931 8822 + 2892 3.3E-74 3.3E-74 c 1 TGATTAAACTTTGCGGATTAAATTG:TGATTAAATCTAATGTGTTGAATTG
CP000521.1_2934384_3000721:10159-55283 IS3 IS3_176 8745 9849 1105 1 8745 8761 9833 9849 26 15 17 0 8916 9775 - 860 9E-38 9E-38 p 1 ATTGATGATAGCCAAAA:ATTGATCCTAGCCAAAA
CP000521.1_2934384_3000721:10159-55283 IS5 IS5_226 9983 10411 429 1 9983 9996 10398 10411 20 12 14 0 9850 10364 + 515 7.2E-28 7.2E-28 p 1 TATCATTCATTATA:TATCATTCAGCATA
CP000521.1_2934384_3000721:10159-55283 IS5 IS5_302 23918 24796 879 1 23918 23953 24762 24796 54 33 36 1 23947 24699 - 753 3E-82 3E-82 c 1 AAAATCAAAATAATGCTTAGGGCGTGTCCTCATTTG:AAAATCAAAATGATGC-TAGGGCGTGTCTTCATTTG
Postprocessing of the relative postions in the 45kb.fasta in the results IS_of_45kb.xlsx to the absolute positions of the genome
#https://github.com/xiezhq/ISEScan?tab=readme-ov-file#Usage
Columns in NC_012624.fna.csv (NC_012624.fna.tsv, NC_012624.fna.raw):
#seqID: sequence identifier
#family: family name of IS element
#cluster: Tpase cluster
#isBegin and isEnd: genome coordinates of the predicted IS element
#isLen: length of the predicted IS element
#ncopy4is: number of predicted IS copies including full-length and partial IS copies
#start1, end1, start2, end2: genome coordinates of the IRs
#score: score of the IRs
#irId: number of identical matches in pairwise alignment of left and righ hand invered repeats
#irLen, length of inverted repeats
#nGaps: number of gaps in IRs
#orfBegin, orfEnd: genome coordinates of the predicted Tpase ORF
#strand: strand where the Tpase is
#orfLen: length of predicted Tpase ORF
#E-value: the best E-value among all IS copies for the same IS element, the smaller the better
#E-value4copy: the E-value of the reported IS copy, the smaller the better
#type: type of IS element copy, 'c' for complete IS element and 'p' for partial IS element
#ov: ov number returned by hmmer search
#tir: terminal inverted repeat sequences
#Note: the E-value is the E-value returned by hmmer when searching profile HMMs against proteome translated from a genome sequence
samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:5818-8737 > 1_1.fasta
samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:5818-5842 > 1_2.fasta
samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:8713-8737 > 1_3.fasta
seqtk seq -r 1_3.fasta > 1_3_rc.fasta
samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:5931-8822 > 1_4.fasta
samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:8745-9849 > 2_1.fasta
samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:8745-8761 > 2_2.fasta
samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:9833-9849 > 2_3.fasta
seqtk seq -r 2_3.fasta > 2_3_rc.fasta
samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:8916-9775 > 2_4.fasta
seqtk seq -r 2_4.fasta > 2_4_rc.fasta
samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:9983-10411 > 3_1.fasta
samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:9983-9996 > 3_2.fasta
samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:10398-10411 > 3_3.fasta
seqtk seq -r 3_3.fasta > 3_3_rc.fasta
samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:9850-10364 > 3_4.fasta
samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:23918-24796 > 4_1.fasta
samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:23918-23953 > 4_2.fasta
samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:24762-24796 > 4_3.fasta
seqtk seq -r 4_3.fasta > 4_3_rc.fasta
samtools faidx 45kb.fasta CP000521.1_2934384_3000721:10159-55283:23947-24699 > 4_4.fasta
seqtk seq -r 4_4.fasta > 4_4_rc.fasta
ATG (77 %), GTG (14 %), TTG
Stopcodons in der DNA:
TAG (Thymin - Adenin - Guanin)
TAA (Thymin - Adenin - Adenin)
TGA (Thymin - Guanin - Adenin)
CP000521.1_2934384_3000721:10159-55283 IS5 IS5_222 5818 8737 2920 1 5818 5842 8713 8737 18 17 25 0 5931 8822
CP000521.1_2934384_3000721:10159-55283 IS3 IS3_176 8745 9849 1105 1 8745 8761 9833 9849 26 15 17 0 8916 9775
CP000521.1_2934384_3000721:10159-55283 IS5 IS5_226 9983 10411 429 1 9983 9996 10398 10411 20 12 14 0 9850 10364
CP000521.1_2934384_3000721:10159-55283 IS5 IS5_302 23918 24796 879 1 23918 23953 24762 24796 54 33 36 1 23947 24699
seqID family cluster isBegin isEnd isLen ncopy4is start1 end1 start2 end2 score irId irLen nGaps orfBegin orfEnd strand orfLen E-value E-value4copy type ov tir
CP000521.1 IS5 IS5_222 2,950,360 2,953,279 2,920 1 2,950,360 2,950,388 2,953,239 2,953,267 18 17 25 0 2,950,473 2,953,364
+ 2,892 3.3E-74 3.3E-74 c 1 TGATTAAACTTTGCGGATTAAATTG
CP000521.1 IS3 IS3_176 2,953,287 2,954,391 1,105 1 2,953,287 2,953,303 2,954,359 2,954,375 26 15 17 0 2,953,458 2,954,317
- 860 9E-38 9E-38 p 1 ATTGATGATAGCCAAAA
CP000521.1 IS5 IS5_226 2,954,509 2,954,937 429 1 2,954,509 2,954,522 2,954,924 2,954,937 20 12 14 0 2,954,376 2,954,890
+ 515 7.2E-28 7.2E-28 p 1 TATCATTCATTATA
CP000521.1 IS5 IS5_302 2,969,444 2,970,322 879 1 2,969,444 2,969,479 2,970,248 2,970,282 54 33 36 1 2,969,473 2,970,225
- 753 3E-82 3E-82 c 1 AAAATCAAAATAATGCTTAGGGCGTGTCCTCATTTG
I checked the results of the following command, they should be TGATTAAACTTTGCGGATTAAATTG, However it is a little different.
samtools faidx A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2950360-2950388
GATTAAACTTTGCGGATTAAATTGACAGA
2950359-5818=+2944541
1..45125
2944541..2989666
#confirm the position-changed sheet!
CP000521.1 IS5 IS5_222 2950359 2953278 2920 1 2950359 2950383 2953254 2953278 18 17 25 0 2950472 2953363
CP000521.1 IS3 IS3_176 2953286 2954390 1105 1 2953286 2953302 2954374 2954390 26 15 17 0 2953457 2954316
CP000521.1 IS5 IS5_226 2954524 2954952 429 1 2954524 2954537 2954939 2954952 20 12 14 0 2954391 2954905
CP000521.1 IS5 IS5_302 2968459 2969337 879 1 2968459 2968494 2969303 2969337 54 33 36 1 2968488 2969240
samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2950359-2953278 > 1_1_changed.fasta
samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2950359-2950383 > 1_2_changed.fasta
samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2953254-2953278 > 1_3_changed.fasta
seqtk seq -r 1_3_changed.fasta > 1_3_rc_changed.fasta
samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2950472-2953363 > 1_4_changed.fasta
samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2953286-2954390 > 2_1_changed.fasta
samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2953286-2953302 > 2_2_changed.fasta
samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2954374-2954390 > 2_3_changed.fasta
seqtk seq -r 2_3_changed.fasta > 2_3_rc_changed.fasta
samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2953457-2954316 > 2_4_changed.fasta
seqtk seq -r 2_4_changed.fasta > 2_4_rc_changed.fasta
samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2954524-2954952 > 3_1_changed.fasta
samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2954524-2954537 > 3_2_changed.fasta
samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2954939-2954952 > 3_3_changed.fasta
seqtk seq -r 3_3_changed.fasta > 3_3_rc_changed.fasta
samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2954391-2954905 > 3_4_changed.fasta
samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2968459-2969337 > 4_1_changed.fasta
samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2968459-2968494 > 4_2_changed.fasta
samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2969303-2969337 > 4_3_changed.fasta
seqtk seq -r 4_3_changed.fasta > 4_3_rc_changed.fasta
samtools faidx Easyfig_files3/A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2968488-2969240 > 4_4_changed.fasta
seqtk seq -r 4_4_changed.fasta > 4_4_rc_changed.fasta
diff 1_1.fasta 1_1_changed.fasta
diff 1_2.fasta 1_2_changed.fasta
diff 1_3.fasta 1_3_changed.fasta
diff 1_4.fasta 1_4_changed.fasta
diff 2_1.fasta 2_1_changed.fasta
diff 2_2.fasta 2_2_changed.fasta
diff 2_3.fasta 2_3_changed.fasta
diff 2_4.fasta 2_4_changed.fasta
diff 3_1.fasta 3_1_changed.fasta
diff 3_2.fasta 3_2_changed.fasta
diff 3_3.fasta 3_3_changed.fasta
diff 3_4.fasta 3_4_changed.fasta
diff 4_1.fasta 4_1_changed.fasta
diff 4_2.fasta 4_2_changed.fasta
diff 4_3.fasta 4_3_changed.fasta
diff 4_4.fasta 4_4_changed.fasta
It should be ATG.... (gene length)
The positions are not correct. Note that I changed the coordinate twice in the following commands:
samtools faidx A.baumannii_ATCC17978.gbk_converted.fna CP000521.1:2934384-3000721 > ../ATCC17978_segment.fasta
makeblastdb -in ATCC17978_segment.fasta -dbtype nucl
blastn -db ATCC17978_segment.fasta -query ATCC19606_segment.fasta -num_threads 15 -outfmt 6 -strand both -evalue 0.1 > ATCC19606_segment_on_ATCC17978_segment.blastn
samtools faidx ATCC17978_segment.fasta CP000521.1_2934384_3000721:10159-55283 > 45kb.fasta
The postions in the original tables are relative postions in the 45kb.fasta. Please correct it!
我想比对的是这27,694 Acinetobacter calcoaceticus/baumannii complex Genomes (https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=909768) 中,目的基因 (https://www.ncbi.nlm.nih.gov/nuccore/NC_010410.1?from=974325&to=975530) 是否是conserved的 (先不管SNP, InDel或其他突变现象)。就如这两篇paper的结果显示"AdeIJK is highly conserved across the Acinetobacter genus" (Paper1: Evolution of RND efflux pumps in the development of a successful pathogen; Paper2: RND pumps across the genus Acinetobacter: AdeIJK is the universal efflux pump). Analysis results: In our analysis of 38,638 records, we discovered that 65 of them do not contain the specific genes based on the submitted genomes from GenBank (refer to the details below). Attached, you will find a document that provides detailed information on the positions of the genes present in the remaining genomes. However, it's important to note that, based on my experience, GenBank does contain a number of erroneously assembled genomes. Therefore, it's conceivable that the absence of the gene in some isolates could be attributed not to its actual absence but rather to errors in genome assembly.
~/Tools/datasets download genome taxon 909768
#ncbi_dataset_taxon909768
~/Tools/datasets download genome accession --inputfile sampled_accession_numbers.txt --include gff3,gbff,genome
#./blastn_1st_batch.sh generating ../blastn_1st_res
for sample in GCF_000757665.1_ASM75766v1_genomic.fna GCF_000746645.1_ASM74664v1_genomic.fna; do
echo "makeblastdb -in ${sample} -dbtype nucl"
echo "blastn -db ${sample} -query ../../ABAYE_RS05070.fasta -evalue 0.1 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 >> ../blastn_merged_res_1_1000"
done
for sample in GCF_000757665.1_ASM75766v1_genomic.fna GCF_000746645.1_ASM74664v1_genomic.fna; do
echo "makeblastdb -in ${sample} -dbtype nucl"
echo "blastn -db ${sample} -query ../../ABAYE_RS05070.fasta -evalue 0.1 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 >> blastn_merged_res_4001_8000"
done
...
TODO NEXT WEEK: senden a EXCEL storing the merged blastn results! look if all 38638 records contain gene, if not, mark yellow (explain it is also possible an error of assembly of the genome regarding the specific isolate!)
#GenBank GCA_000810025.3
38638
makeblastdb -in GCF_000757665.1_ASM75766v1_genomic.fna -dbtype nucl
blastn -db GCF_000757665.1_ASM75766v1_genomic.fna -query ../../ABAYE_RS05070.fasta -evalue 0.1 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 >> blastn_merged_res
makeblastdb -in GCF_000746645.1_ASM74664v1_genomic.fna -dbtype nucl
blastn -db GCF_000746645.1_ASM74664v1_genomic.fna -query ../../ABAYE_RS05070.fasta -evalue 0.1 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 >> blastn_merged_res
#find ./ncbi_dataset/sampled_100/ -name "*.fna" -print0 | xargs -0 mv -t /destination/directory/
find . -mindepth 2 -maxdepth 2 -type f -name "*.fna" -print0 | xargs -0 mv -t .
find . -mindepth 1 -maxdepth 1 -type f -name "*.fna" >../fna_list2
# The genome was downloaded at Mär 25, 2024
I have over 38638 genome with the taxon-id 909768, want to assembly 100 from them,
the data strcture as the following:
./data/GCF_000757665.1/GCF_000757665.1_ASM75766v1_genomic.fna
./data/GCF_000746645.1/GCF_000746645.1_ASM74664v1_genomic.fna
./data/GCF_000746605.1/GCF_000746605.1_ASM74660v1_genomic.fna
./data/GCF_000738845.1/GCF_000738845.1_ASM73884v1_genomic.fna
./data/GCF_000734775.1/GCF_000734775.1_ASM73477v1_genomic.fna
./data/GCF_000731965.1/GCF_000731965.1_ASM73196v1_genomic.fna
./data/GCF_000708795.1/GCF_000708795.1_ABU310_genomic.fna
./data/GCF_000708775.1/GCF_000708775.1_ASM70877v1_genomic.fna
./data/GCF_000695855.3/GCF_000695855.3_ASM69585v3_genomic.fna
./data/GCF_000248195.1/GCF_000248195.1_ASM24819v2_genomic.fna
....
I have a total of 38638 genomes in fasta format in local computer. I want to quick check if they are contain a gene ABAYE_RS05070 (/product="multidrug effflux MFS transporter") https://www.ncbi.nlm.nih.gov/nuccore/NC_010410.1?from=974325&to=975530)
# What are the default values of -perc_identity and -qcov_hsp_perc in blastn?
#-perc_identity 90 -qcov_hsp_perc 90
By default:
-perc_identity is set to 0, meaning that all alignments found will be reported regardless of percent identity.
-qcov_hsp_perc is set to 0 as well, indicating that all alignments will be reported regardless of query coverage per HSP.
makeblastdb -in ncbi_dataset/data/GCF_904885835.1/GCF_904885835.1_Aci00717-181015_contigs_filtered.fa.gz_genomic.fna -dbtype nucl
blastn -db ncbi_dataset/data/GCF_904885835.1/GCF_904885835.1_Aci00717-181015_contigs_filtered.fa.gz_genomic.fna -query ABAYE_RS05070.fasta -evalue 0.1 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1
makeblastdb -in 100.fasta -dbtype nucl
# -max_target_seqs 1
blastn -db 100.fasta -query ../../ABAYE_RS05070.fasta -evalue 0.1 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1
with the following commands:
for sample in GCF_000809205.3_ASM80920v3_genomic.fna GCF_000809215.3_ASM80921v3_genomic.fna GCF_000809225.3_ASM80922v3_genomic.fna _genomic.fna
GCF_001907125.1_ASM190712v1_genomic.fna GCF_001907145.1_ASM190714v1_genomic.fna GCF_001907155.1_ASM190715v1_genomic.fna GCF_001908295.1_ASM190829v1_genomic.fna GCF_001909135.1_ASM190913v1_genomic.fna GCF_001910585.1_ASM191058v1_genomic.fna GCF_001910595.1_ASM191059v1_genomic.fna GCF_001910605.1_ASM191060v1_genomic.fna GCF_001910615.1_ASM191061v1_genomic.fna GCF_001910665.1_ASM191066v1_genomic.fna GCF_001910675.1_ASM191067v1_genomic.fna GCF_008632635.1_ASM863263v1_genomic.fna; do
makeblastdb -in ${sample} -dbtype nucl
blastn -db ${sample} -query ../ABAYE_RS05070.fasta -evalue 0.1 -num_threads 15 -outfmt 6 -strand both -max_target_seqs 1 >> ../blastn_1st_res
done
get the following list, I want to add the two parts in the filenames also in the results as the second and third columns.
For example for GCF_001907155.1_ASM190715v1_genomic.fna,
the current results is
"gi|169794206|ref|NC_010410.1|:974325-975530 NZ_JWTW03000099.1 97.595 1206 29 0 1 1206 211093 209888 0.0 2067"
It should be
"gi|169794206|ref|NC_010410.1|:974325-975530 GCF_001907155.1 ASM190715v1 NZ_JWTW03000099.1 97.595 1206 29 0 1 1206 211093 209888 0.0 2067"
gi|169794206|ref|NC_010410.1|:974325-975530 NZ_JWYU03000064.1 97.595 1206 29 0 1 1206 73028 74233 0.0 2067
gi|169794206|ref|NC_010410.1|:974325-975530 NZ_CP010397.1 97.181 1206 34 0 1 1206 983267 984472 0.0 2039
gi|169794206|ref|NC_010410.1|:974325-975530 NZ_HG977522.1 97.595 1206 29 0 1 1206 935195 936400 0.0 2067
gi|169794206|ref|NC_010410.1|:974325-975530 NZ_HG977526.1 97.595 1206 29 0 1 1206 934286 935491 0.0 2067
gi|169794206|ref|NC_010410.1|:974325-975530 NZ_AP014649.1 97.595 1206 29 0 1 1206 960072 961277 0.0 2067
gi|169794206|ref|NC_010410.1|:974325-975530 NZ_CP010781.1 100.000 1206 0 0 1 1206 2943042 2941837 0.0 2228
gi|169794206|ref|NC_010410.1|:974325-975530 NZ_BBSU01000014.1 98.093 1206 23 0 1 1206 30546 29341 0.0 2100
gi|169794206|ref|NC_010410.1|:974325-975530 NZ_BBSP01000001.1 97.430 1206 31 0 1 1206 511041 509836 0.0 2056
#!/bin/bash
> ../blastn_res
for sample in GCF_000809205.3_ASM80920v3_genomic.fna GCF_000809215.3_ASM80921v3_genomic.fna GCF_000809225.3_ASM80922v3_genomic.fna GCF_001907125.1_ASM190712v1_genomic.fna GCF_001907145.1_ASM190714v1_genomic.fna GCF_001907155.1_ASM190715v1_genomic.fna GCF_001908295.1_ASM190829v1_genomic.fna GCF_001909135.1_ASM190913v1_genomic.fna GCF_001910585.1_ASM191058v1_genomic.fna GCF_001910595.1_ASM191059v1_genomic.fna GCF_001910605.1_ASM191060v1_genomic.fna GCF_001910615.1_ASM191061v1_genomic.fna GCF_001910665.1_ASM191066v1_genomic.fna GCF_001910675.1_ASM191067v1_genomic.fna GCF_008632635.1_ASM863263v1_genomic.fna; do
# Extract the second and third parts from the filename
filename_parts=$(echo "${sample}" | cut -d'_' -f1-3)
# Create the blast database
makeblastdb -in "${sample}" -dbtype nucl
# Run blastn and append the results with filename parts
blastn -db "${sample}" -query ../ABAYE_RS05070.fasta -evalue 0.1 -num_threads 15 -outfmt '6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore' -strand both -max_target_seqs 1 | awk -v filename_parts="${filename_parts}" '{print $0 "\t" filename_parts}' >> ../blastn_res
done
#a specific assembly version of a genome in the NCBI (National Center for Biotechnology Information) database
#GCF_000248195.1_ASM24819v2 Acinetobacter baumannii strain NCTC 7422 contig_0002, whole genome shotgun sequence
DELETE, since they are all the same: 1. qseqid [query or source (gene) sequence id]
2. sseqid GenBank ID [subject or target (reference genome) sequence id]
3. pident Percentage of identical positions
4. length Alignment length (sequence overlap)
5. mismatch Number of mismatches
6. gapopen Number of gap openings
7. qstart Start of alignment in query
8. qend End of alignment in query
9. sstart Start of alignment in subject
10. send End of alignment in subject
11. evalue Expect value
12. bitscore Bit score
13. Assembly accession number
GenBank ID Percentage of identical positions Alignment length Number of mismatches Number of gap openings Start of alignment in query End of alignment in query Start of alignment in subject End of alignment in subject Expect value Bit score Assembly accession number
query_seq_id sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
gi|169794206|ref|NC_010410.1|:974325-975530 NZ_AIED01000002.1 82.047 1192 210 4 1 1190 161177 159988 0.0 1013 GCF_000248195.1_ASM24819v2
#batch 1: until ...802.2_genomic.fna GCA_032600445.2_PDT001925801.2_genomic.fna GCA_032600465.2_PDT001925800.2_genomic.fna GCA_032600485.2_PDT001925799.2_genomic.fna GCA_032600525.2_PDT001925797.2_genomic.fna GCA_032600545.2_PDT001925796.2_genomic.fna GCA_032600565.2_PDT001925795.2_genomic.fna GCA_032600585.2_PDT001925793.2_genomic.fna
#batch 2: from GCA_032600605.2_PDT001925792.2_genomic.fna GCA_032600625.2_PDT001925794.2_genomic.fna GCA_032600645.2_PDT001925791.2_genomic.fna ...
#to compare with:
gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCE020000004.1 97.595 1206 29 0 1 1206 96245 95040 0.0 2067 GCA_032600385.2_PDT001925803.2
gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCB020000004.1 97.595 1206 29 0 1 1206 96245 95040 0.0 2067 GCA_032600395.2_PDT001925804.2
gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCD020000003.1 97.595 1206 29 0 1 1206 96245 95040 0.0 2067 GCA_032600425.2_PDT001925802.2
gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCF020000001.1 97.595 1206 29 0 1 1206 210993 209788 0.0 2067 GCA_032600445.2_PDT001925801.2 <--
gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCG020000001.1 97.595 1206 29 0 1 1206 210993 209788 0.0 2067 GCA_032600465.2_PDT001925800.2
gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCH020000002.1 97.595 1206 29 0 1 1206 203882 202677 0.0 2067 GCA_032600485.2_PDT001925799.2
gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCJ020000003.1 97.595 1206 29 0 1 1206 113990 112785 0.0 2067 GCA_032600525.2_PDT001925797.2
gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCK020000002.1 97.595 1206 29 0 1 1206 210992 209787 0.0 2067 GCA_032600545.2_PDT001925796.2
gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCM020000002.1 97.595 1206 29 0 1 1206 210993 209788 0.0 2067 GCA_032600565.2_PDT001925795.2
gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCN020000002.1 97.595 1206 29 0 1 1206 210993 209788 0.0 2067 GCA_032600585.2_PDT001925793.2 <--
Assembly accession number GenBank accession number
Can I say NZ_AIED01000002.1 Genbank_ID, and GCF_000248195.1_ASM24819v2 Assembly_ID?
Yes, you can interpret these identifiers within the given line as follows:
* NZ_AIED01000002.1 can be referred to as a GenBank accession number. This format typically represents a sequence identifier within the GenBank database, part of the NCBI's collection of databases related to nucleotide sequences and their annotations.
* GCF_000248195.1_ASM24819v2 refers to an assembly accession number from the GenBank assembly database. The "GCF" prefix indicates a RefSeq genome assembly accession, which is part of NCBI's Reference Sequence Database. The number following "GCF" is a unique identifier for the specific assembly of the genome, and the information after the underscore provides a version identifier and possibly an internal project code or name.
In a broader context, the line you've provided looks like it's from a BLAST output or a similar comparative genomics tool output. Such lines typically contain information about sequence alignments, including the query and subject sequences (with their respective accession numbers), alignment scores, and statistical significance metrics. The presence of both a sequence accession number (GenBank ID) and an assembly accession number helps link individual sequences to the larger genomic context from which they were derived.
30662 + 7989 = 38651 > 38638
gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCM020000002.1 97.595 1206 29 0 1 1206 210993 209788 0.0 2067 GCA_032600565.2_PDT001925795.2
gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCN020000002.1 97.595 1206 29 0 1 1206 210993 209788 0.0 2067 GCA_032600585.2_PDT001925793.2
gi|169794206|ref|NC_010410.1|:974325-975530 ABNRCO020000003.1 97.595 1206 29 0 1 1206 54521 53316 0.0 2067 GCA_032600605.2_PDT001925792.2
# In the file containing the following lines, I want to extract the first tokens separated by '_'.
./GCF_009013175.1_ASM901317v1_genomic.fna
./GCF_013344775.1_ASM1334477v1_genomic.fna
./GCF_013344845.1_ASM1334484v1_v1_genomic.fna
./GCF_013346345.1_ASM1334634v1_genomic.fna
./GCF_013346385.1_ASM1334638v1_genomic.fna
38651-38638=13
diff fna_list_tokens_1_3 blastn_res_f13 > diff2
diff fna_list_tokens_1_3 blastn_res_f13_uniq > diff3
# Open the file named 'fna_list' and read its content
with open('fna_list', 'r') as file:
lines = file.readlines()
# Extract the first token after the first '_' in each line
first_tokens = [line.split('_')[1].strip() for line in lines]
# Example output
print(first_tokens)
点赞本文的读者
还没有人对此文章表态
没有评论
Transposon analyses for the nanopore sequencing
Updated List of nf-core Pipelines (Released) Sorted by Stars (as of November 22, 2024)
Variant Calling for Herpes Simplex Virus 1 from Patient Sample Using Capture Probe Sequencing
© 2023 XGenes.com Impressum