Directory Listings Summary

/media/jhuang/INTENSO

(empty; data now on ~/DATA_Intenso)

# Name
1 (empty)

~/DATA

# Name
1 Data_Ute_MKL1
2 Data_Ute_RNA_4_2022-11_test
3 Data_Ute_RNA_3
4 Data_Susanne_Carotis_RNASeq_PUBLISHING
5 Data_Jiline_Yersinia_SNP
6 Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex_DUPLICATED_DEL
7 Data_Nicole_CRC1648
8 Mouse_HS3ST1_12373_out
9 Mouse_HS3ST1_12175_out
10 Data_Biobakery
11 Data_Xiaobo_10x_2
12 Data_Xiaobo_10x_3
13 Talk_Nicole_CRC1648
14 Talks_Bioinformatics_Meeting
15 Talks_resources
16 Data_Susanne_MPox_DAMIAN
17 Data_host_transcriptional_response
18 Talks_including_DEEP-DV
19 DOKTORARBEIT
20 Data_Susanne_MPox
21 Data_Jiline_Transposon
22 Data_Jiline_Transposon2
23 Data_Matlab
24 deepseek-ai
25 Stick_Mi_DEL
26 TODO_shares
27 Data_Ute_RNA_4
28 Data_Liu_PCA_plot
29 README_run_viral-ngs_inside_Docker
30 README_compare_genomes
31 mapped.bam
32 Data_Serpapi
33 Data_Ute_RNA_1_2
34 Data_Marc_RNAseq_2024
35 Data_Nicole_CaptureProbeSequencing
36 LOG_mapping
37 Data_Huang_Human_herpesvirus_3
38 Data_Nicole_DAMIAN_Post-processing_Pathoprobe_FluB_Links
39 Access_to_Win7
40 Data_DAMIAN_Post-processing_Flavivirus_and_FSME_and_Haemophilus
41 Data_Luise_Sepi_STKN
42 Data_Patricia_Sepi_7samples
43 Data_Soeren_2025_PUBLISHING
44 Data_Ben_RNAseq_2025
45 Data_Tam_DNAseq_2025_AYE-WT_Q_S_craA-Tig4_craA-1-Cm200_craA-2-Cm200
46 Data_Patricia_Transposon
47 Data_Patricia_Transposon_2025
48 Colocation_Space
49 Data_Tam_Methylation_2025_empty
50 2025-11-03_eVB-Schreiben_12-57.pdf
51 DEGs_Group1_A1-A3+A8-A10_vs_Group2_B10-B16.png
52 README.pdf
53 Data_Hannes_JCM00612
54 167_redundant_DEL
55 Lehre_Bioinformatik
56 Data_Ben_Boruta_Analysis
57 Data_Childrensclinic_16S_2025_DEL
58 Data_Ben_Mycobacterium_pseudoscrofulaceum
59 Foong_RNA_mSystems_Huang_Changed.txt
60 Data_Pietro_Scatturo_and_Charlotte_Uetrecht_16S_2025
61 Data_JuliaBerger_RNASeq_SARS-CoV-2
62 Data_PaulBongarts_S.epidermidis_HDRNA
63 Data_Ute
64 Data_Foong_DNAseq_2025_AYE_Dark_vs_Light_TODO
65 Data_Foong_RNAseq_2021_ATCC19606_Cm
66 Data_Tam_Funding
67 Data_Tam_RNAseq_2025_LB-AB_IJ_W1_Y1_WT_vs_Mac-AB_IJ_W1_Y1_WT_on_ATCC19606
68 Data_Tam_RNAseq_2025_subMIC_exposure_on_ATCC19606
69 Data_Tam.txt
70 Data_Tam_RNAseq_2024_AUM_MHB_Urine_on_ATCC19606
71 Data_Tam_Metagenomics_2026
72 Data_Michelle
73 Data_Nicole_16S_2025_Childrensclinic
74 Data_Sophie_HDV_Sequences
75 Data_Tam_DNAseq_2026_19606deltaIJfluE
76 README_nf-core
77 Data_Vero_Kymographs
78 Access_to_Win10
79 Data_Patricia_AMRFinderPlus_2025
80 Data_Tam_DNAseq_2025_Unknown-adeABadeIJ_adeIJK_CM1_CM2
81 Data_Damian
82 Data_Karoline_16S
83 Data_JuliaFuchs_RNAseq_2025
84 Data_Tam_DNAseq_2025_ATCC19606-Y1Y2Y3Y4W1W2W3W4_TODO
85 Data_Tam_DNAseq_2026_Acinetobacter_harbinensis
86 Data_Benjamin_DNAseq_2026_GE11174
87 Data_Susanne_spatialRNA_2022.9.1_backup
88 Data_Susanne_spatialRNA

~/DATA_A

# Name
1 Data_Damian_NEW_CREATED
2 Data_R_bubbleplots
3 Data_Ute_TRANSFERED_DEL
4 Paper_Target_capture_sequencing_MHH_PUBLISHED
5 Data_Nicole8_Lamprecht_new_PUBLISHED
6 Data_Samira_RNAseq

~/DATA_B

# Name
1 Data_DAMIAN_endocarditis_encephalitis
2 Data_Denise_sT_PUBLISHING
3 Data_Fran2_16S_func
4 Data_Holger_5179-R1_vs_5179
5 Antraege_
6 Data_16S_Nicole_210222
7 Data_Adam_Influenza_A_virus
8 Data_Anna_Efaecium_assembly
9 Data_Bactopia
10 Data_Ben_RNAseq
11 Data_Johannes_PIV3
12 Data_Luise_Epidome_longitudinal_nose
13 Data_Manja_Hannes_Probedesign
14 Data_Marc_AD_PUBLISHING
15 Data_Marc_RNA-seq_Saureus_Review
16 Data_Nicole_16S
17 Data_Nicole_cfDNA_pathogens
18 Data_Ring_and_CSF_PegivirusC_DAMIAN
19 Data_Song_Microarray
20 Data_Susanne_Omnikron
21 Data_Viro
22 Doktorarbeit
23 Poster_Rohde_20230724
24 Data_Django
25 Data_Holger_S.epidermidis_1585_5179_HD05
26 Data_Manja_RNAseq_Organoids_Virus
27 Data_Holger_MT880870_MT880872_Annotation
28 Data_Soeren_RNA-seq_2022
29 Data_Manja_RNAseq_Organoids_Merged
30 Data_Gunnar_Yersiniomics
31 Data_Manja_RNAseq_Organoids
32 Data_Susanne_Carotis_MS

~/DATA_C

(names only; as listed)

# Name
1 2022-10-27_IRI_manuscript_v03_JH.docx
2 16304905.fasta
3 ’16S data manuscript_NF.docx’
4 180820_2_supp_4265595_sw6zjk.docx
5 180820_2_supp_4265596_sw6zjk.docx
6 1a_vs_3.csv
7 ‘2.05.01.05-A01 Urlaubsantrag-Shuting-beantragt.pdf’
8 2014SawickaBBA.pdf
9 20160509Manuscript_NDM_OXA_mitKomm.doc
10 220607_Agenda_monthly_meeting.pdf
11 ‘20221129 Table mutations.docx’
12 230602_NB501882_0428_AHKG53BGXT.zip
13 362383173.rar
14 562.9459.1.fa
15 562.9459.1_rc.fa
16 ASA3P.pdf
17 All_indels_annotated_vHR.xlsx
18 ‘Amplikon_indeces_Susanne +groups.xlsx’
19 Amplikon_indeces_Susanne.xlsx
20 GAMOLA2
21 Data_Susanne_Carotis_spatialRNA_PUBLISHING (dead link)
22 Data_Paul_Staphylococcus_epidermidis
23 Data_Nicola_Schaltenberg_PICRUSt
24 Data_Nicola_Schaltenberg
25 Data_Nicola_Gagliani
26 Data_methylome_MMc
27 Data_Jingang
28 Data_Indra_RNASeq_GSM2262901
29 Data_Holger_VRE
30 Data_Holger_Pseudomonas_aeruginosa_SNP
31 Data_Hannes_ChIPSeq
32 Data_Emilia_MeDIP
33 Data_ChristophFR_HepE_published
34 Data_Christopher_MeDIP_MMc_published
35 Data_Anna_Kieler_Sepi_Staemme
36 Data_Anna12_HAPDICS_final
37 Data_Anastasia_RNASeq_PUBLISHING
38 Aufnahmeantrag_komplett_10_2022.pdf
39 Astrovirus.pdf
40 COMMANDS
41 Bacterial_pipelines.txt
42 COMPSRA_uke_DEL.jar
43 ChIPSeq_pipeline_desc.docx
44 ChIPSeq_pipeline_desc.pdf
45 Comparative_genomic_analysis_of_eight_novel_haloal.pdf
46 CvO_Klassenliste_7_3.pdf
47 ‘Copy of pool_b1_CGATGT_300.xlsx’
48 Fran_16S_Exp8-17-21-27.txt
49 HPI_DRIVE
50 HEV_aligned.fasta
51 INTENSO_DIR
52 HPI_samples_for_NGS_29.09.22.xlsx
53 Hotmail_to_Gmail
54 Indra_Thesis_161020.pdf
55 ‘LT K331A.gbk’
56 LOG_p954_stat
57 LOG
58 Manuscript_10_02_2021.docx
59 Metagenomics_Tools_and_Insights.pdf
60 ‘Miseq Amplikon LAuf April.xlsx’
61 NGS.tar.gz
62 Nachweis_Bakterien_Viren_im_Hochdurchsatz.pdf
63 Nicole8_Lamprecht_logs
64 Nanopore.handouts.pdf
65 ‘Norovirus paper Susanne 191105.docx’
66 PhyloRNAalifold.pdf
67 README_R
68 README_RNAHiSwitch_DEL
69 RNA-NGS_Analysis_modul3_NanoStringNorm.zip
70 RNAConSLOptV1.2.tar.gz
71 ‘RSV GFP5 including 3`UTR.docx’
72 SNPs_on_pangenome.txt
73 SERVER
74 R_tutorials-master.zip
75 Rawdata_Readme.pdf
76 SUB10826945_record_preview.txt
77 S_staphylococcus_annotated_diff_expr.xls
78 Snakefile_list
79 Source_Classification_Code.rds
80 Supplementary_Table_S3.xlsx
81 Untitled.ipynb
82 UniproUGENE_UserManual.pdf
83 Untitled1.ipynb
84 Untitled2.ipynb
85 Untitled3.ipynb
86 WAC6h_vs_WAP6h_down.txt
87 damian_nodbs
88 WAC6h_vs_WAP6h_up.txt
89 ‘add. Figures Hamburg_UKE.pptx’
90 all_gene_counts_with_annotation.xlsx
91 app_flask.py
92 bengal-bay-0.1.json
93 bengal3_ac3.yml
94 call_shell_from_Ruby.png
95 bengal3ac3.yml
96 empty.fasta
97 coefficients_csaw_vs_diffreps.xlsx
98 exchange.txt
99 exdata-data-NEI_data.zip
100 genes_wac6_wap6.xls
101 go1.13.linux-amd64.tar.gz.1
102 hev_p2-p5.fa
103 map_corrected_backup.txt
104 install_nginx_on_hamm
105 hg19.rmsk.bed
106 metadata-9563675-processed-ok.tsv
107 mkg_sprechstundenflyer_ver1b_dezember_2019.pdf
108 multiqc_config.yaml
109 p11326_OMIKRON3398_corsurv.gb
110 p11326_OMIKRON3398_corsurv.gb_converted.fna
111 parseGenbank_reformat.py
112 pangenome-snakemake-master.zip
113 ‘phylo tree draft.pdf’
114 qiime_params.txt
115 pool_b1_CGATGT_300.zip
116 qiime_params_backup.txt
117 qiime_params_s16_s18.txt
118 snakePipes
119 results_description.html
120 rnaalihishapes.tar.gz
121 rnaseq_length_bias.pdf
122 3932-Leber
123 BioPython
124 Biopython
125 DEEP-DV
126 DOKTORARBEIT
127 Data_16S_Arck_vaginal_stool
128 Data_16S_BS052
129 Data_16S_Birgit
130 Data_16S_Christner
131 Data_16S_Leonie
132 Data_16S_PatientA-G_CSF
133 Data_16S_Schaltenberg
134 Data_16S_benchmark
135 Data_16S_benchmark2
136 Data_16S_gcdh_BKV
137 Data_Alex1_Amplicon
138 Data_Alex1_SNP
139 Data_Analysis_for_Life_Science
140 Data_Anna13_vanA-Element
141 Data_Anna14_PACBIO_methylation
142 Data_Anna_C.acnes2_old_DEL
143 Data_Anna_MT880872_update
144 Data_Anna_gap_filling_agrC
145 Data_Baechlein_Hepacivirus_2018
146 Data_Bornavirus
147 Data_CSF
148 Data_Christine_cz19-178-rothirsch-bovines-hepacivirus
149 Data_Daniela_adenovirus_WGS
150 Data_Emilia_MeDIP_DEL
151 Data_Francesco2021_16S
152 Data_Francesco2021_16S_re
153 Data_Gunnar_MS
154 Data_Hannes_RNASeq
155 Data_Holger_Efaecium_variants_PUBLISHED
156 Data_Holger_VRE_DEL
157 Data_Icebear_Damian
158 Data_Indra3_H3K4_p2_DEL
159 Data_Indra6_RNASeq_ChipSeq_Integration_DEL
160 Data_Indra_Figures
161 Data_KatjaGiersch_new_HDV
162 Data_MHH_Encephalitits_DAMIAN
163 Data_Manja_RPAChIPSeq_public
164 Data_Manuel_WGS_Yersinia
165 Data_Manuel_WGS_Yersinia2_DEL
166 Data_Manuel_WGS_Yersinia_DEL
167 Data_Marcus_tracrRNA_structures
168 Data_Mausmaki_Damian
169 Data_Nicole1_Tropheryma_whipplei
170 Data_Nicole5
171 Data_Nicole5_77-92
172 Data_PaulBecher_Rotavirus
173 Data_Pietschmann_HCV_Amplicon_bigFile
174 Data_Piscine_Orthoreovirus_3_in_Brown_Trout
175 Data_Proteomics
176 Data_RNABioinformatics
177 Data_RNAKinetics
178 Data_R_courses
179 Data_SARS-CoV-2
180 Data_SARS-CoV-2_Genome_Announcement_PUBLISHED
181 Data_Seite
182 Data_Song_aggregate_sum
183 Data_Susanne_Amplicon_RdRp_orf1_2_re
184 Data_Tabea_RNASeq
185 Data_Thaiss1_Microarray_new
186 Data_Tintelnot_16S
187 Data_Wuenee_Plots
188 Data_Yang_Poster
189 Data_jupnote
190 Data_parainfluenza
191 Data_snakemake_recipe
192 Data_temp
193 Data_viGEN
194 Genomic_Data_Science
195 Learn_UGENE
196 MMcPaper
197 Manuscript_Epigenetics_Macrophage_Yersinia
198 Manuscript_RNAHiSwitch
199 MeDIP_Emilia_copy_DEL
200 Method_biopython
201 NGS
202 Okazaki-Seq_Processing
203 RNA-NGS_Analysis_modul3_NanoStringNorm
204 RNAConSLOptV1.2
205 RNAHeliCes
206 RNA_li_HeliCes
207 RNAliHeliCes
208 RNAliHeliCes_Relatedshapes_modified
209 R_refcard
210 R_DataCamp
211 R_cats_package
212 R_tutorials-master
213 SnakeChunks
214 align_4l_on_FJ705359
215 align_4p_on_FJ705359
216 assembly
217 bacto
218 bam2fastq_mapping_again
219 chipster
220 damian_GUI
221 enhancer-snakemake-demo
222 hg19_gene_annotations
223 interlab_comparison_DEL
224 my_flask
225 papers
226 pangenome-snakemake_zhaoc1
227 pyflow-epilogos
228 raw_data_rnaseq_Indra
229 test_raw_data_dnaseq
230 test_raw_data_rnaseq
231 to_Francesco
232 ukepipe
233 ukepipe_nf
234 var_www_DjangoApp_mysite2_2023-05
235 roentgenpass.pdf
236 salmon_tx2gene_GRCh38.tsv
237 salmon_tx2gene_chrHsv1.tsv
238 ‘sample IDs_Lamprecht.xlsx’
239 summarySCC_PM25.rds
240 untitled.py
241 tutorial-rnaseq.pdf
242 x.log
243 webapp.tar.gz
244 temp
245 temp2
246 Data_Susanne_Amplicon_haplotype_analyses_RdRp_orf1_2_re
247 Data_Susanne_WGS_unbiased

~/DATA_D

# Name
1 Data_Soeren_RNA-seq_2023_PUBLISHING
2 Data_Ute
3 Data_Marc_RNA-seq_Sepidermidis
4 Data_Patricia_Transposon
5 Books_DA_for_Life
6 Data_Sven
7 Datasize_calculation_based_on_coverage.txt
8 Data_Paul_HD46_1-wt_resequencing
9 Data_Sanam_DAMIAN
10 Data_Tam_variant_calling
11 Data_Samira_Manuscripts
12 Data_Silvia_VoltRon_Debug
13 Data_Pietschmann_229ECoronavirus_Mutations_2024
14 Data_Pietschmann_229ECoronavirus_Mutations_2025
15 Data_Birthe_Svenja_RSV_Probe3_PUBLISHING

~/DATA_E

# Name
1 j_huang_until_201904
2 Data_2019_April
3 Data_2019_May
4 Data_2019_June
5 Data_2019_July
6 Data_2019_August
7 Data_2019_September
8 Data_Song_RNASeq_PUBLISHED
9 Data_Laura_MP_RNASeq
10 Data_Nicole6_HEV_Swantje2
11 Data_Becher_Damian_Picornavirus_BovHepV
12 bacteria_refseq.zip
13 bacteria_refseq
14 Data_Rotavirus
15 Data_Xiaobo_10x
16 Data_Becher_Damian_Picornavirus_BovHepV_INCOMPLETE_DEL

~/DATA_Intenso

# Name
1 HOME_FREIBURG_DEL
2 150810_M03701_0019_000000000-AFJFK
3 Data_Thaiss2_Microarray
4 VirtualBox_VMs_DEL
5 ‘VirtualBox VMs_DEL’
6 ‘VirtualBox VMs2_DEL’
7 websites
8 DATA
9 Data_Laura
10 Data_Laura_2
11 Data_Laura_3
12 galaxy_tools
13 Downloads2
14 Downloads
15 mom-baby_com_cn
16 ‘VirtualBox VMs2’
17 VirtualBox_VMs
18 CLC_Data
19 Work_Dir2
20 Work_Dir2_SGE
21 Data_SPANDx1_Kpneumoniae_vs_Assembly1
22 MauveOutput
23 Fastqs
24 Data_Anna3_VRE_Ausbruch
25 Work_Dir_mock_broad_mockinput
26 Work_Dir_dM_broad_mockinput
27 Data_Anna8_RNASeq_static_shake_deprecated
28 PENDRIVE_cont
29 Work_Dir_WAP_broad_mockinput
30 Work_Dir_WAC_broad_mockinput
31 Work_Dir_dP_broad_mockinput
32 Data_Nicole10_16S_interlab
33 PAPERS
34 TB
35 Data_Anna4_SNP
36 Data_Carolin1_16S
37 ChipSeq_Raw_Data3_171009_NB501882_0024_AHNGTYBGX3
38 m_aepfelbacher_DEL.zip
39 Data_Anna7_RNASeq_Cytoscape
40 Data_Nicole9_Hund_Katze_Mega
41 Data_Anna2_CO6114
42 Data_Nicole3_TH17_orig
43 Data_Nicole1_Tropheryma_whipplei
44 results_K27
45 ‘VirtualBox VMs’
46 Data_Anna6_RNASeq
47 Data_Anna1_1585_RNAseq
48 Data_Thaiss1_Microarray
49 Data_Nicole7_Anelloviruses_Polyomavirus
50 Data_Nina1_Nicole5_1-76
51 Data_Nina1_merged
52 Data_Nicole8_Lamprecht
53 Data_Anna5_SNP
54 chipseq
55 Downloads_DEL
56 Data_Gagliani2_enriched_16S
57 Data_Gagliani1_18S_16S
58 m_aepfelbacher
59 Data_Susanne_WGS_3amplicons

/media/jhuang/Titisee

# Name
1 Data_Anna4_SNP
2 Data_Anna5_SNP_rsync_error
3 TRASH
4 Data_Nicole6_HEV_4_SNP_calling_PE_DEL
5 Data_Nina1_Nicole7
6 Data_Nicole6_HEV_4_SNP_calling_SE_DEL
7 180119_M03701_0115_000000000-BFG46.zip
8 Data_Nicole10_16S_interlab_PUBLISHED
9 Anna11_assemblies
10 Anna11_trees
11 Data_Nicole6_HEV_new_orig_fastqs
12 Data_Anna9_OXA-48_or_OXA-181
13 bengal_results_v1_2018
14 DO.pdf
15 damian_DEL
16 MAGpy_db
17 UGENE_v1_32_data_cistrome
18 UGENE_v1_32_data_ngs_classification
19 Data_Nicole6_HEV_Swantje
20 Data_Nico_Gagliani
21 GAMOLA2_prototyp
22 Thomas_methylation_EPIC_DO
23 Data_Nicola_Schaltenberg
24 Data_Nicola_Schaltenberg_PICRUSt
25 HOME_FREIBURG
26 Data_Francesco_16S
27 3rd_party
28 ConsPred_prokaryotic_genome_annotation
29 ‘System Volume Information’
30 damian_v201016
31 Data_Holger_VRE
32 Data_Holger_Pseudomonas_aeruginosa_SNP
33 Eigene_Ordner_HR
34 GAMOLA2
35 Data_Anastasia_RNASeq
36 Data_Amir_PUBLISHED
37 ‘$RECYCLE.BIN’
38 Data_Xiaobo_10x_3
39 Data_Tam_DNAseq_2023_Comparative_ATCC19606_AYE_ATCC17978
40 Data_Holger_S.epidermidis_short
41 TEMP
42 Data_Holger_S.epidermidis_long

/media/jhuang/Elements(Denise_ChIPseq)

# Name
1 Data_Denise_LTtrunc_H3K27me3_2_results_DEL
2 Data_Denise_LTtrunc_H3K4me3_2_results_DEL
3 Data_Anna12_HAPDICS_final_not_finished_DEL
4 m_aepfelbacher_DEL
5 Data_Damian
6 ST772_DEL
7 ALL_trimmed_part_DEL
8 Data_Denise_ChIPSeq_Protocol1
9 Data_Pietschmann_HCV_Amplicon
10 Data_Nicole6_HEV_ownMethod_new
11 HD04-1.fasta
12 RNAHiSwitch_
13 RNAHiSwitch__
14 RNAHiSwitch___
15 RNAHiSwitchpaper
16 RNAHiSwitch_milestone1_DELETED
17 RNAHiSwitch_paper.tar.gz
18 RNAHiSwitch_paper_DELETED
19 RNAHiSwitch_milestone1
20 RNAHiSwitch_paper
21 Ute_RNASeq_results
22 Ute_miRNA_results_38
23 RNAHiSwitch
24 Data_HepE_Freiburg_PUBLISHED
25 Data_INTENSO_2022-06
26 ‘$RECYCLE.BIN’
27 ‘System Volume Information’
28 Data_Anna_Mixta_hanseatica_PUBLISHED
29 coi_disclosure.docx
30 Data_Jingang
31 **Data_Susanne_16S_re_UNPUBLISHED ***
32 Data_Denise_ChIPSeq_Protocol2
33 Data_Caroline_RNAseq_wt_timecourse
34 Data_Caroline_RNAseq_brain_organoids
35 Data_Amir_PUBLISHED_DEL
36 Data_download_virus_fam
37 Data_Gunnar_Yersiniomics_COPYFAILED_DEL
38 Data_Paul_and_Marc_Epidome_batch3
39 ifconfig_hamm.txt
40 Data_Soeren_2023_PUBLISHING
41 Data_Birthe_Svenja_RSV_Probe3_PUBLISHING
42 Data_Ute
43 **Data_Susanne_16S_UNPUBLISHED ***

/media/jhuang/Seagate Expansion Drive(HOffice)

# Name
1 SeagateExpansion.ico
2 Autorun.inf
3 Start_Here_Win.exe
4 Warranty.pdf
5 Start_Here_Mac.app
6 Seagate
7 HomeOffice_DIR (Data_Anna_HAPDICS_RNASeq, From_Samsung_T5)
8 DATA_COPY_FROM_178528 (copy_and_clean.sh, logfile_jhuang.log, jhuang)
9 ‘System Volume Information’
10 ‘$RECYCLE.BIN’

/media/jhuang/Elements(Anna_C.arnes)

# Name
1 Data_Swantje_HEV_using_viral-ngs
2 VIPER_static_DEL
3 Data_Nicole6_HEV_Swantje1_blood
4 Data_Nicole6_HEV_benchmark
5 Data_Denise_RNASeq_GSE79958
6 Data_16S_Leonie_from_Nico_Gaglianis
7 Fastqs_19-21
8 ‘System Volume Information’
9 Data_Luise_Epidome_test
10 Data_Anna_C.acnes_PUBLISHED
11 Data_Denise_LT_DNA_Bindung
12 Data_Denise_LT_K331A_RNASeq
13 Data_Luise_Epidome_batch1
14 Data_Luise_Pseudomonas_aeruginosa_PUBLISHED
15 Data_Luise_Epidome_batch2
16 picrust2_out_2024_2
17 ‘$RECYCLE.BIN’

/media/jhuang/Seagate Expansion Drive(DATA_COPY_FROM_hamburg)

# Name
1 Autorun.inf
2 Start_Here_Win.exe
3 Warranty.pdf
4 Start_Here_Mac.app
5 Seagate
6 DATA_COPY_TRANSFER_INCOMPLETE_DEL
7 DATA_COPY_FROM_hamburg

/media/jhuang/Seagate Expansion Drive(Seagate_1)

# Name
1 RNA_seq_analysis_tools_2013
2 Data_Laura0
3 Data_Petra_Arck
4 Data_Martin_mycoplasma
5 chromhmm-enhancers
6 ChromHMM_Dir
7 Data_Denise_sT_H3K4me3
8 Data_Denise_sT_H3K27me3
9 Start_Here_Mac.app
10 Seagate
11 Data_Nicole16_parapoxvirus
12 Project_h_rohde_Susanne_WGS_unbiased_DEL.zip
13 Data_Denise_ChIPSeq_Protocol1
14 Data_ENNGS_pathogen_detection_pipeline_comparison
15 j_huang_201904_202002
16 Data_Laura_ChIPseq_GSE120945
17 batch_200314_incomplete
18 m_aepfelbacher.zip
19 m_error_DEL
20 batch_200325
21 batch_200319
22 GAMOLA2_prototyp
23 Data_Nicola_Gagliani
24 2017-18_raw_data
25 Data_Arck_MeDIP
26 trimmed
27 Data_Nicole_16S_Christmas_2020_2
28 j_huang_202007_202012
29 Data_Nicole_16S_Christmas_2020
30 Downloads_2021-01-18_DEL
31 Data_Laura_plasmid
32 Data_Laura_16S_2_re
33 Data_Laura_16S_2
34 Data_Laura_16S_2re
35 Data_Laura_16S_merged
36 Downloads_DEL
37 Data_Laura_16S
38 Data_Anna12_HAPDICS_final
39 ‘$RECYCLE.BIN’
40 ‘System Volume Information’

/media/jhuang/Seagate Expansion Drive(Seagate_2)

# Name
1 Data_Nicole4_TH17
2 Start_Here_Win.exe
3 Autorun.inf
4 Warranty.pdf
5 Start_Here_Mac.app
6 Seagate
7 Data_Denise_RNASeq_trimmed_DEL
8 HD12
9 Qi_panGenome
10 ALL
11 fastq_HPI_bw_2019_08_and_2020_02
12 f1_R1_link.sh
13 f1_R2_link.sh
14 rtpd_files
15 m_aepfelbacher.zip
16 Data_Nicole_16S_Hamburg_Odense_Cornell_Muenster
17 HyAsP_incomplete_genomes
18 HyAsP_normal_sampled_input
19 HyAsP_complete_genomes
20 video.zip
21 sam2bedgff.pl
22 HD04.infection.hS_vs_HD04.nose.hS_annotated_degenes.xls
23 ALL83
24 Data_Pietschmann_RSV_Probe_PUBLISHED
25 HyAsP_normal
26 Data_Manthey_16S
27 rtpd_files_DEL
28 HyAsP_bold
29 Data_HEV
30 Seq_VRE_hybridassembly
31 Data_Anna12_HAPDICS_raw_data_shovill_prokka
32 Data_Anna_HAPDICS_WGS_ALL
33 Data_HEV_Freiburg_2020
34 Data_Nicole_HDV_Recombination_PUBLISHED
35 s_hero2x
36 201030_M03701_0207_000000000-J57B4.zip
37 README
38 ‘README(1)’
39 dna2.fasta.fai
40 91.pep
41 91.orf
42 91.orf.fai
43 dgaston-dec-06-2012-121211124858-phpapp01.pdf
44 tileshop.fcgi
45 ppat.1009304.s016.tif
46 sequence.txt
47 ‘sequence(1).txt’
48 GSE128169_series_matrix.txt.gz
49 GSE128169_family.soft.gz
50 Data_Anna_HAPDICS_RNASeq
51 Data_Christopher_MeDIP_MMc_PUBLISHED
52 Data_Gunnar_Yersiniomics_IMCOMPLETE_DEL
53 Data_Denise_RNASeq
54 ‘System Volume Information’
55 ‘$RECYCLE.BIN’

/media/jhuang/Elements(An14_RNAs)

# Name
1 Data_Anna10_RP62A
2 Data_Nicole12_16S_Kluwe_Bunders
3 chromhmm-enhancers
4 Data_Denise_sT_Methylation
5 Data_Denise_LTtrunc_Methylation
6 Data_16S_arckNov
7 Data_Tabea_RNASeq
8 nr_gz_README
9 j_huang_raw_fq
10 ‘System Volume Information’
11 ‘$RECYCLE.BIN’
12 host_refs
13 Vraw
14 **Data_Susanne_Amplicon_RdRp_orf1_2 ***
15 tmp
16 Data_RNA188_Paul_Becher
17 Data_ChIPSeq_Laura
18 Data_16S_arckNov_review_PUBLISHED
19 Data_16S_arckNov_re
20 Fastqs
21 Data_Tabea_RNASeq_submission
22 Data_Anna_Cutibacterium_acnes_DEL
23 Data_Silvia_RNASeq_SUBMISSION
24 Data_Hannes_ChIPSeq
25 Data_Anna14_RNASeq_to_be_DEL
26 Data_Pietschmann_RSV_Probe2_PUBLISHED
27 Data_Holger_Klebsiella_pneumoniae_SNP_PUBLISHING
28 Data_Anna14_RNASeq_plus_public

/media/jhuang/Elements(Indra_HAPDICS)

# Name
1 Data_Anna11_Sepdermidis_DEL
2 HD15_without_10
3 HD31
4 HD33
5 HD39
6 HD43
7 HD46
8 HD15_with_10
9 HD26
10 HD59
11 HD25
12 HD21
13 HD17
14 HD04
15 Data_Anna11_Pair1-6_P6
16 Data_Anna12_HAPDICS_HyAsP
17 HAPDICS_hyasp_plasmids
18 Data_Anna_HAPDICS_review
19 data_overview.txt
20 align_assem_res_DEL
21 ‘System Volume Information’
22 EXCHANGE_DEL
23 Data_Indra_H3K4me3_public
24 Data_Gunnar_MS
25 ‘$RECYCLE.BIN’
26 UKE_DELLWorkstation_C_Users_indbe_Desktop
27 Linux_DELLWorkstation_C_Users_indbe_VirtualBoxVMs
28 Data_Anna_HAPDICS_RNASeq_rawdata
29 Data_Indra_H3K27ac_public
30 Data_Holger_Klebsiella_pneumoniae_SNP_PUBLISHING
31 DATA_INDRA_RNASEQ
32 DATA_INDRA_CHIPSEQ

/media/jhuang/Elements(jhuang_*)

# Name
1 ‘Install Western Digital Software for Windows.exe’
2 ‘Install Western Digital Software for Mac.dmg’
3 ‘System Volume Information’
4 ‘$RECYCLE.BIN’
5 20250203_FS10003086_95_BTR67811-0621

/media/jhuang/Smarty

# Name
1 lost+found
2 Blast_db
3 temporary_files_DEL
4 ALIGN_ASSEM
5 Data_Paul_Staphylococcus_epidermidis
6 Data_16S_Degenhardt_Marius_DEL
7 Data_Gunnar_Yersiniomics_DEL
8 Data_Manja_RNAseq_Organoids_Virus
9 Data_Emilia_MeDIP
10 DjangoApp_Backup_2023-10-30
11 ref
12 Data_Michelle_RNAseq_2025_raw_data_DEL_AFTER_UPLOAD_GEO

Original input (as one point)

/media/jhuang/INTENSO is empty --> Now the data are on ~/DATA_Intenso
/dev/sdg1       3,7T  512K  3,7T   1% /media/jhuang/INTENSO

jhuang@WS-2290C:~/DATA$ ls -tlrh
total 1,6M
drwxrwxrwx   6 jhuang jhuang 4,0K Okt 26  2022 Data_Ute_MKL1
drwxrwxrwx   8 jhuang jhuang 4,0K Jan 13  2023 Data_Ute_RNA_4_2022-11_test
drwxrwxr-x   7 jhuang jhuang 4,0K Mär  8  2023 Data_Ute_RNA_3
drwxr-xr-x  11 jhuang jhuang 4,0K Dez 19  2023 Data_Susanne_Carotis_RNASeq_PUBLISHING
drwxr-xr-x  21 jhuang jhuang 4,0K Jun 18  2024 Data_Jiline_Yersinia_SNP
drwxrwxr-x   5 jhuang jhuang 4,0K Jul 22  2024 Data_Tam_ABAYE_RS05070_on_A_calcoaceticus_baumannii_complex_DUPLICATED_DEL
drwxr-xr-x   2 jhuang jhuang 4,0K Jul 23  2024 Data_Nicole_CRC1648
drwxr-xr-x   4 jhuang jhuang 4,0K Sep  6  2024 Mouse_HS3ST1_12373_out
drwxr-xr-x   4 jhuang jhuang 4,0K Sep  6  2024 Mouse_HS3ST1_12175_out
drwxrwxr-x  10 jhuang jhuang 4,0K Sep 12  2024 Data_Biobakery
drwxrwxr-x   6 jhuang jhuang 4,0K Sep 23  2024 Data_Xiaobo_10x_2
drwxr-xr-x   4 jhuang jhuang 4,0K Sep 23  2024 Data_Xiaobo_10x_3
drwxr-xr-x   3 jhuang jhuang 4,0K Sep 26  2024 Talk_Nicole_CRC1648
drwxr-xr-x   2 jhuang jhuang 4,0K Sep 26  2024 Talks_Bioinformatics_Meeting
drwxr-xr-x   2 jhuang jhuang  12K Sep 26  2024 Talks_resources
drwxrwxr-x   6 jhuang jhuang  12K Okt 10  2024 Data_Susanne_MPox_DAMIAN
drwxrwxr-x   3 jhuang jhuang 4,0K Okt 14  2024 Data_host_transcriptional_response
drwxr-xr-x  13 jhuang jhuang 4,0K Okt 23  2024 Talks_including_DEEP-DV
drwxrwxr-x   2 jhuang jhuang 4,0K Okt 24  2024 DOKTORARBEIT
drwxrwxr-x  18 jhuang jhuang 4,0K Nov 11  2024 Data_Susanne_MPox
drwxrwxr-x  25 jhuang jhuang  12K Nov 11  2024 Data_Jiline_Transposon
drwxrwxr-x  16 jhuang jhuang  20K Nov 25  2024 Data_Jiline_Transposon2
drwxrwxr-x   3 jhuang jhuang 4,0K Dez 13  2024 Data_Matlab
drwxrwxr-x   5 jhuang jhuang 4,0K Jan 28  2025 deepseek-ai
drwx------   4 jhuang jhuang 4,0K Feb  5  2025 Stick_Mi_DEL
-rw-rw-r--   1 jhuang jhuang 1,1K Feb 18  2025 TODO_shares
drwxrwxrwx  13 jhuang jhuang 4,0K Mär  3  2025 Data_Ute_RNA_4
drwxrwxr-x   2 jhuang jhuang 4,0K Mär 31  2025 Data_Liu_PCA_plot
-rw-rw-r--   1 jhuang jhuang  43K Apr  3  2025 README_run_viral-ngs_inside_Docker
-rw-rw-r--   1 jhuang jhuang 8,7K Apr  9  2025 README_compare_genomes
-rw-rw-r--   1 jhuang jhuang    0 Apr 11  2025 mapped.bam
drwxrwxr-x   3 jhuang jhuang 4,0K Apr 24  2025 Data_Serpapi
drwxrwxrwx  22 jhuang jhuang 4,0K Apr 30  2025 Data_Ute_RNA_1_2
drwxrwxr-x  15 jhuang jhuang 4,0K Apr 30  2025 Data_Marc_RNAseq_2024
drwxrwxr-x  45 jhuang jhuang  12K Mai 15  2025 Data_Nicole_CaptureProbeSequencing
-rw-rw-r--   1 jhuang jhuang  657 Mai 23  2025 LOG_mapping
drwxrwxr-x  46 jhuang jhuang 4,0K Mai 26  2025 Data_Huang_Human_herpesvirus_3
drwxrwxr-x   8 jhuang jhuang 4,0K Jun 13  2025 Data_Nicole_DAMIAN_Post-processing_Pathoprobe_FluB_Links
lrwxrwxrwx   1 jhuang jhuang   37 Jun 16  2025 Access_to_Win7 -> ./Data_Marius_16S/picrust2_out_2024_2
drwxrwxr-x  17 jhuang jhuang 4,0K Jun 18  2025 Data_DAMIAN_Post-processing_Flavivirus_and_FSME_and_Haemophilus
drwxr-xr-x  42 jhuang jhuang  36K Jun 23  2025 Data_Luise_Sepi_STKN
drwxrwxr-x  29 jhuang jhuang  20K Jul 22  2025 Data_Patricia_Sepi_7samples
drwxr-xr-x   9 jhuang jhuang 4,0K Aug  8  2025 Data_Soeren_2025_PUBLISHING
drwxrwxr-x   9 jhuang jhuang 4,0K Aug 13  2025 Data_Ben_RNAseq_2025
drwxrwxr-x  34 jhuang jhuang  12K Sep  3 12:18 Data_Tam_DNAseq_2025_AYE-WT_Q_S_craA-Tig4_craA-1-Cm200_craA-2-Cm200
drwxrwxr-x  50 jhuang jhuang  16K Okt  6 17:59 Data_Patricia_Transposon
drwxrwxr-x  23 jhuang jhuang  12K Okt 20 13:27 Data_Patricia_Transposon_2025
drwxrwxr-x   2 jhuang jhuang 4,0K Okt 23 12:21 Colocation_Space
drwxrwxr-x   2 jhuang jhuang 4,0K Okt 27 12:56 Data_Tam_Methylation_2025_empty
-rw-rw-r--   1 jhuang jhuang 151K Nov  3 13:01 2025-11-03_eVB-Schreiben_12-57.pdf
-rw-rw-r--   1 jhuang jhuang  67K Nov  5 16:59 DEGs_Group1_A1-A3+A8-A10_vs_Group2_B10-B16.png
-rw-rw-r--   1 jhuang jhuang 687K Nov 14 09:55 README.pdf
drwxrwxr-x   2 jhuang jhuang 4,0K Nov 24 15:43 Data_Hannes_JCM00612
drwxrwxr-x   3 jhuang jhuang 4,0K Dez  4 17:03 167_redundant_DEL
drwxrwxr-x   2 jhuang jhuang 4,0K Dez  8 10:33 Lehre_Bioinformatik
drwxrwxr-x  27 jhuang jhuang  12K Dez  8 11:29 Data_Ben_Boruta_Analysis
drwxrwxr-x  18 jhuang jhuang 4,0K Dez  8 17:39 Data_Childrensclinic_16S_2025_DEL
drwxrwxr-x   2 jhuang jhuang 4,0K Dez 10 10:05 Data_Ben_Mycobacterium_pseudoscrofulaceum
-rw-rw-r--   1 jhuang jhuang 8,9K Dez 15 12:42 Foong_RNA_mSystems_Huang_Changed.txt
drwxrwxr-x  22 jhuang jhuang 4,0K Dez 17 13:07 Data_Pietro_Scatturo_and_Charlotte_Uetrecht_16S_2025
drwxrwxr-x   8 jhuang jhuang 4,0K Dez 18 10:45 Data_JuliaBerger_RNASeq_SARS-CoV-2
drwxrwxr-x  19 jhuang jhuang 4,0K Jan  3 17:42 Data_PaulBongarts_S.epidermidis_HDRNA
lrwxrwxrwx   1 jhuang jhuang   31 Jan 12 14:30 Data_Ute -> /media/jhuang/Elements/Data_Ute
drwxrwxr-x  12 jhuang jhuang 4,0K Jan 16 12:44 Data_Foong_DNAseq_2025_AYE_Dark_vs_Light_TODO
drwxrwxrwx  22 jhuang jhuang 4,0K Jan 16 12:48 Data_Foong_RNAseq_2021_ATCC19606_Cm
drwxrwxr-x   2 jhuang jhuang 4,0K Jan 16 13:02 Data_Tam_Funding
drwxrwxr-x   9 jhuang jhuang 4,0K Jan 16 13:32 Data_Tam_RNAseq_2025_LB-AB_IJ_W1_Y1_WT_vs_Mac-AB_IJ_W1_Y1_WT_on_ATCC19606
drwxrwxr-x  12 jhuang jhuang 4,0K Jan 16 13:32 Data_Tam_RNAseq_2025_subMIC_exposure_on_ATCC19606
-rw-rw-r--   1 jhuang jhuang 1,2K Jan 16 13:34 Data_Tam.txt
drwxrwxr-x  16 jhuang jhuang 4,0K Jan 16 13:37 Data_Tam_RNAseq_2024_AUM_MHB_Urine_on_ATCC19606
drwxrwxr-x  10 jhuang jhuang 4,0K Jan 16 18:22 Data_Tam_Metagenomics_2026
drwxrwxr-x   6 jhuang jhuang  16K Jan 23 16:35 Data_Michelle
drwxrwxr-x  38 jhuang jhuang  12K Jan 28 15:20 Data_Nicole_16S_2025_Childrensclinic
drwxr-xr-x 145 jhuang jhuang  36K Jan 29 10:49 Data_Sophie_HDV_Sequences
drwxrwxr-x   4 jhuang jhuang 4,0K Jan 30 11:44 Data_Tam_DNAseq_2026_19606deltaIJfluE
-rw-rw-r--   1 jhuang jhuang  63K Jan 30 17:53 README_nf-core
drwxrwxr-x  22 jhuang jhuang 4,0K Feb  4 10:43 Data_Vero_Kymographs
drwxrwxr-x  13 jhuang jhuang 4,0K Feb  4 14:06 Access_to_Win10
drwxrwxr-x   7 jhuang jhuang 4,0K Feb  5 11:59 Data_Patricia_AMRFinderPlus_2025
drwxrwxr-x  45 jhuang jhuang 4,0K Feb  6 11:54 Data_Tam_DNAseq_2025_Unknown-adeABadeIJ_adeIJK_CM1_CM2
drwxrwxr-x  41 jhuang jhuang  12K Feb  9 15:11 Data_Damian
drwxrwxr-x   6 jhuang jhuang 4,0K Feb 13 12:48 Data_Karoline_16S
drwxrwxr-x  13 jhuang jhuang  12K Feb 13 18:09 Data_JuliaFuchs_RNAseq_2025
drwxrwxr-x  18 jhuang jhuang 4,0K Feb 16 11:19 Data_Tam_DNAseq_2025_ATCC19606-Y1Y2Y3Y4W1W2W3W4_TODO
drwxrwxr-x  34 jhuang jhuang 4,0K Feb 16 15:54 Data_Tam_DNAseq_2026_Acinetobacter_harbinensis
drwxrwxr-x  19 jhuang jhuang 4,0K Feb 16 17:13 Data_Benjamin_DNAseq_2026_GE11174
drwxrwxrwx  36 jhuang jhuang  12K Feb 17 15:02 Data_Susanne_spatialRNA_2022.9.1_backup
drwxrwxr-x  39 jhuang jhuang  12K Feb 17 15:12 Data_Susanne_spatialRNA

jhuang@WS-2290C:~/DATA_A$ ls -ltrh
total 24K
drwxr-xr-x  7 jhuang jhuang 4,0K Jun 18  2024 Data_Damian_NEW_CREATED
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024 Data_R_bubbleplots
drwxr-xr-x 16 jhuang jhuang 4,0K Jun 18  2024 Data_Ute_TRANSFERED_DEL
drwxr-xr-x  2 jhuang jhuang 4,0K Okt  7  2024 Paper_Target_capture_sequencing_MHH_PUBLISHED
drwxr-xr-x 20 jhuang jhuang 4,0K Okt  8  2024 Data_Nicole8_Lamprecht_new_PUBLISHED
drwxrwxrwx  8 jhuang jhuang 4,0K Mai 21  2025 Data_Samira_RNAseq

jhuang@WS-2290C:~/DATA_B$ ls -tlrh
total 136K
drwxr-xr-x  3 jhuang jhuang 4,0K Jun 18  2024 Data_DAMIAN_endocarditis_encephalitis
drwxr-xr-x  8 jhuang jhuang 4,0K Jun 18  2024 Data_Denise_sT_PUBLISHING
drwxr-xr-x 12 jhuang jhuang 4,0K Jun 18  2024 Data_Fran2_16S_func
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024 Data_Holger_5179-R1_vs_5179
drwxr-xr-x 16 jhuang jhuang 4,0K Jun 18  2024 Antraege_
drwxr-xr-x 18 jhuang jhuang 4,0K Jun 18  2024 Data_16S_Nicole_210222
drwxr-xr-x  6 jhuang jhuang 4,0K Jun 18  2024 Data_Adam_Influenza_A_virus
drwxr-xr-x 14 jhuang jhuang  12K Jun 18  2024 Data_Anna_Efaecium_assembly
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024 Data_Bactopia
drwxr-xr-x  5 jhuang jhuang 4,0K Jun 18  2024 Data_Ben_RNAseq
drwxr-xr-x  7 jhuang jhuang 4,0K Jun 18  2024 Data_Johannes_PIV3
drwxr-xr-x 19 jhuang jhuang 4,0K Jun 18  2024 Data_Luise_Epidome_longitudinal_nose
drwxr-xr-x  6 jhuang jhuang 4,0K Jun 18  2024 Data_Manja_Hannes_Probedesign
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024 Data_Marc_AD_PUBLISHING
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024 Data_Marc_RNA-seq_Saureus_Review
drwxr-xr-x 17 jhuang jhuang 4,0K Jun 18  2024 Data_Nicole_16S
drwxr-xr-x  3 jhuang jhuang 4,0K Jun 18  2024 Data_Nicole_cfDNA_pathogens
drwxr-xr-x 16 jhuang jhuang 4,0K Jun 18  2024 Data_Ring_and_CSF_PegivirusC_DAMIAN
drwxr-xr-x  4 jhuang jhuang 4,0K Jun 18  2024 Data_Song_Microarray
drwxr-xr-x 11 jhuang jhuang 4,0K Jun 18  2024 Data_Susanne_Omnikron
drwxr-xr-x  3 jhuang jhuang 4,0K Jun 18  2024 Data_Viro
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024 Doktorarbeit
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024 Poster_Rohde_20230724
drwxr-xr-x  6 jhuang jhuang 4,0K Jul 12  2024 Data_Django
drwxr-xr-x 35 jhuang jhuang 4,0K Okt 21  2024 Data_Holger_S.epidermidis_1585_5179_HD05
drwxr-xr-x  9 jhuang jhuang 4,0K Nov 18  2024 Data_Manja_RNAseq_Organoids_Virus
drwxr-xr-x  2 jhuang jhuang 4,0K Feb 21  2025 Data_Holger_MT880870_MT880872_Annotation
drwxr-xr-x 12 jhuang jhuang 4,0K Apr  8  2025 Data_Soeren_RNA-seq_2022
drwxr-xr-x  5 jhuang jhuang 4,0K Apr 11  2025 Data_Manja_RNAseq_Organoids_Merged
drwxr-xr-x 24 jhuang jhuang 4,0K Apr 25  2025 Data_Gunnar_Yersiniomics
drwxr-xr-x 10 jhuang jhuang 4,0K Jan 16 17:14 Data_Manja_RNAseq_Organoids
drwxr-xr-x  3 jhuang jhuang 4,0K Feb 17 12:11 Data_Susanne_Carotis_MS

jhuang@WS-2290C:~/DATA_C$ ls -tlrh
total 13G
-rwxr-xr-x  1 jhuang jhuang 1,7M Jun 18  2024  2022-10-27_IRI_manuscript_v03_JH.docx
-rwxr-xr-x  1 jhuang jhuang 7,1K Jun 18  2024  16304905.fasta
-rwxr-xr-x  1 jhuang jhuang  55K Jun 18  2024 '16S data manuscript_NF.docx'
-rwxr-xr-x  1 jhuang jhuang 792K Jun 18  2024  180820_2_supp_4265595_sw6zjk.docx
-rwxr-xr-x  1 jhuang jhuang  17K Jun 18  2024  180820_2_supp_4265596_sw6zjk.docx
-rwxr-xr-x  1 jhuang jhuang  12K Jun 18  2024  1a_vs_3.csv
-rwxr-xr-x  1 jhuang jhuang  90K Jun 18  2024 '2.05.01.05-A01 Urlaubsantrag-Shuting-beantragt.pdf'
-rwxr-xr-x  1 jhuang jhuang 708K Jun 18  2024  2014SawickaBBA.pdf
-rwxr-xr-x  1 jhuang jhuang  61K Jun 18  2024  20160509Manuscript_NDM_OXA_mitKomm.doc
-rwxr-xr-x  1 jhuang jhuang 289K Jun 18  2024  220607_Agenda_monthly_meeting.pdf
-rwxr-xr-x  1 jhuang jhuang  14K Jun 18  2024 '20221129 Table mutations.docx'
-rwxr-xr-x  1 jhuang jhuang  12G Jun 18  2024  230602_NB501882_0428_AHKG53BGXT.zip
-rwxr-xr-x  1 jhuang jhuang 107K Jun 18  2024  362383173.rar
-rwxr-xr-x  1 jhuang jhuang 128K Jun 18  2024  562.9459.1.fa
-rwxr-xr-x  1 jhuang jhuang 126K Jun 18  2024  562.9459.1_rc.fa
-rwxr-xr-x  1 jhuang jhuang 1,6M Jun 18  2024  ASA3P.pdf
-rwxr-xr-x  1 jhuang jhuang  21K Jun 18  2024  All_indels_annotated_vHR.xlsx
-rwxr-xr-x  1 jhuang jhuang  11K Jun 18  2024 'Amplikon_indeces_Susanne +groups.xlsx'
-rwxr-xr-x  1 jhuang jhuang 9,6K Jun 18  2024  Amplikon_indeces_Susanne.xlsx
-rwxr-xr-x  1 jhuang jhuang   68 Jun 18  2024  GAMOLA2
-rwxr-xr-x  1 jhuang jhuang   88 Jun 18  2024  Data_Susanne_Carotis_spatialRNA_PUBLISHING
-rwxr-xr-x  1 jhuang jhuang  112 Jun 18  2024  Data_Paul_Staphylococcus_epidermidis
-rwxr-xr-x  1 jhuang jhuang  118 Jun 18  2024  Data_Nicola_Schaltenberg_PICRUSt
-rwxr-xr-x  1 jhuang jhuang  100 Jun 18  2024  Data_Nicola_Schaltenberg
-rwxr-xr-x  1 jhuang jhuang   94 Jun 18  2024  Data_Nicola_Gagliani
-rwxr-xr-x  1 jhuang jhuang   96 Jun 18  2024  Data_methylome_MMc
-rwxr-xr-x  1 jhuang jhuang   78 Jun 18  2024  Data_Jingang
-rwxr-xr-x  1 jhuang jhuang  112 Jun 18  2024  Data_Indra_RNASeq_GSM2262901
-rwxr-xr-x  1 jhuang jhuang   84 Jun 18  2024  Data_Holger_VRE
-rwxr-xr-x  1 jhuang jhuang  128 Jun 18  2024  Data_Holger_Pseudomonas_aeruginosa_SNP
-rwxr-xr-x  1 jhuang jhuang   92 Jun 18  2024  Data_Hannes_ChIPSeq
-rwxr-xr-x  1 jhuang jhuang   76 Jun 18  2024  Data_Emilia_MeDIP
-rwxr-xr-x  1 jhuang jhuang   88 Jun 18  2024  Data_ChristophFR_HepE_published
-rwxr-xr-x  1 jhuang jhuang  158 Jun 18  2024  Data_Christopher_MeDIP_MMc_published
-rwxr-xr-x  1 jhuang jhuang  104 Jun 18  2024  Data_Anna_Kieler_Sepi_Staemme
-rwxr-xr-x  1 jhuang jhuang  136 Jun 18  2024  Data_Anna12_HAPDICS_final
-rwxr-xr-x  1 jhuang jhuang   96 Jun 18  2024  Data_Anastasia_RNASeq_PUBLISHING
-rwxr-xr-x  1 jhuang jhuang 169K Jun 18  2024  Aufnahmeantrag_komplett_10_2022.pdf
-rwxr-xr-x  1 jhuang jhuang 1,2M Jun 18  2024  Astrovirus.pdf
-rwxr-xr-x  1 jhuang jhuang  732 Jun 18  2024  COMMANDS
-rwxr-xr-x  1 jhuang jhuang  690 Jun 18  2024  Bacterial_pipelines.txt
-rwxr-xr-x  1 jhuang jhuang  16M Jun 18  2024  COMPSRA_uke_DEL.jar
-rwxr-xr-x  1 jhuang jhuang 239K Jun 18  2024  ChIPSeq_pipeline_desc.docx
-rwxr-xr-x  1 jhuang jhuang 385K Jun 18  2024  ChIPSeq_pipeline_desc.pdf
-rwxr-xr-x  1 jhuang jhuang 2,1M Jun 18  2024  Comparative_genomic_analysis_of_eight_novel_haloal.pdf
-rwxr-xr-x  1 jhuang jhuang  64K Jun 18  2024  CvO_Klassenliste_7_3.pdf
-rwxr-xr-x  1 jhuang jhuang 649K Jun 18  2024 'Copy of pool_b1_CGATGT_300.xlsx'
-rwxr-xr-x  1 jhuang jhuang 3,9K Jun 18  2024  Fran_16S_Exp8-17-21-27.txt
-rwxr-xr-x  1 jhuang jhuang  463 Jun 18  2024  HPI_DRIVE
-rwxr-xr-x  1 jhuang jhuang 179K Jun 18  2024  HEV_aligned.fasta
-rwxr-xr-x  1 jhuang jhuang 4,1K Jun 18  2024  INTENSO_DIR
-rwxr-xr-x  1 jhuang jhuang  14K Jun 18  2024  HPI_samples_for_NGS_29.09.22.xlsx
-rwxr-xr-x  1 jhuang jhuang 4,3K Jun 18  2024  Hotmail_to_Gmail
-rwxr-xr-x  1 jhuang jhuang  13M Jun 18  2024  Indra_Thesis_161020.pdf
-rwxr-xr-x  1 jhuang jhuang 5,2K Jun 18  2024 'LT K331A.gbk'
-rwxr-xr-x  1 jhuang jhuang    0 Jun 18  2024  LOG_p954_stat
-rwxr-xr-x  1 jhuang jhuang 684K Jun 18  2024  LOG
-rwxr-xr-x  1 jhuang jhuang 197K Jun 18  2024  Manuscript_10_02_2021.docx
-rwxr-xr-x  1 jhuang jhuang 595K Jun 18  2024  Metagenomics_Tools_and_Insights.pdf
-rwxr-xr-x  1 jhuang jhuang  14K Jun 18  2024 'Miseq Amplikon LAuf April.xlsx'
-rwxr-xr-x  1 jhuang jhuang 2,2M Jun 18  2024  NGS.tar.gz
-rwxr-xr-x  1 jhuang jhuang 586K Jun 18  2024  Nachweis_Bakterien_Viren_im_Hochdurchsatz.pdf
-rwxr-xr-x  1 jhuang jhuang 1,2K Jun 18  2024  Nicole8_Lamprecht_logs
-rwxr-xr-x  1 jhuang jhuang  24M Jun 18  2024  Nanopore.handouts.pdf
-rwxr-xr-x  1 jhuang jhuang 113K Jun 18  2024 'Norovirus paper Susanne 191105.docx'
-rwxr-xr-x  1 jhuang jhuang 503K Jun 18  2024  PhyloRNAalifold.pdf
-rwxr-xr-x  1 jhuang jhuang  19K Jun 18  2024  README_R
-rwxr-xr-x  1 jhuang jhuang 137K Jun 18  2024  README_RNAHiSwitch_DEL
-rwxr-xr-x  1 jhuang jhuang 8,3M Jun 18  2024  RNA-NGS_Analysis_modul3_NanoStringNorm.zip
-rwxr-xr-x  1 jhuang jhuang  57K Jun 18  2024  RNAConSLOptV1.2.tar.gz
-rwxr-xr-x  1 jhuang jhuang  17K Jun 18  2024 'RSV GFP5 including 3`UTR.docx'
-rwxr-xr-x  1 jhuang jhuang  238 Jun 18  2024  SNPs_on_pangenome.txt
-rwxr-xr-x  1 jhuang jhuang   55 Jun 18  2024  SERVER
-rwxr-xr-x  1 jhuang jhuang  26M Jun 18  2024  R_tutorials-master.zip
-rwxr-xr-x  1 jhuang jhuang 182K Jun 18  2024  Rawdata_Readme.pdf
-rwxr-xr-x  1 jhuang jhuang  40K Jun 18  2024  SUB10826945_record_preview.txt
-rwxr-xr-x  1 jhuang jhuang 283K Jun 18  2024  S_staphylococcus_annotated_diff_expr.xls
-rwxr-xr-x  1 jhuang jhuang 2,0K Jun 18  2024  Snakefile_list
-rwxr-xr-x  1 jhuang jhuang 160K Jun 18  2024  Source_Classification_Code.rds
-rwxr-xr-x  1 jhuang jhuang  61K Jun 18  2024  Supplementary_Table_S3.xlsx
-rwxr-xr-x  1 jhuang jhuang  617 Jun 18  2024  Untitled.ipynb
-rwxr-xr-x  1 jhuang jhuang 127M Jun 18  2024  UniproUGENE_UserManual.pdf
-rwxr-xr-x  1 jhuang jhuang  14M Jun 18  2024  Untitled1.ipynb
-rwxr-xr-x  1 jhuang jhuang 110K Jun 18  2024  Untitled2.ipynb
-rwxr-xr-x  1 jhuang jhuang 2,9K Jun 18  2024  Untitled3.ipynb
-rwxr-xr-x  1 jhuang jhuang  18K Jun 18  2024  WAC6h_vs_WAP6h_down.txt
-rwxr-xr-x  1 jhuang jhuang  100 Jun 18  2024  damian_nodbs
-rwxr-xr-x  1 jhuang jhuang  45K Jun 18  2024  WAC6h_vs_WAP6h_up.txt
-rwxr-xr-x  1 jhuang jhuang 635K Jun 18  2024 'add. Figures Hamburg_UKE.pptx'
-rwxr-xr-x  1 jhuang jhuang 3,7M Jun 18  2024  all_gene_counts_with_annotation.xlsx
-rwxr-xr-x  1 jhuang jhuang  22K Jun 18  2024  app_flask.py
-rwxr-xr-x  1 jhuang jhuang 1,8K Jun 18  2024  bengal-bay-0.1.json
-rwxr-xr-x  1 jhuang jhuang  16K Jun 18  2024  bengal3_ac3.yml
-rwxr-xr-x  1 jhuang jhuang 246K Jun 18  2024  call_shell_from_Ruby.png
-rwxr-xr-x  1 jhuang jhuang 8,1K Jun 18  2024  bengal3_ac3_.yml
-rwxr-xr-x  1 jhuang jhuang   12 Jun 18  2024  empty.fasta
-rwxr-xr-x  1 jhuang jhuang  32K Jun 18  2024  coefficients_csaw_vs_diffreps.xlsx
-rwxr-xr-x  1 jhuang jhuang 4,3K Jun 18  2024  exchange.txt
-rwxr-xr-x  1 jhuang jhuang  30M Jun 18  2024  exdata-data-NEI_data.zip
-rwxr-xr-x  1 jhuang jhuang 6,6K Jun 18  2024  genes_wac6_wap6.xls
-rwxr-xr-x  1 jhuang jhuang 115M Jun 18  2024  go1.13.linux-amd64.tar.gz.1
-rwxr-xr-x  1 jhuang jhuang  29K Jun 18  2024  hev_p2-p5.fa
-rwxr-xr-x  1 jhuang jhuang 3,8K Jun 18  2024  map_corrected_backup.txt
-rwxr-xr-x  1 jhuang jhuang  325 Jun 18  2024  install_nginx_on_hamm
-rwxr-xr-x  1 jhuang jhuang  20M Jun 18  2024  hg19.rmsk.bed
-rwxr-xr-x  1 jhuang jhuang 107K Jun 18  2024  metadata-9563675-processed-ok.tsv
-rwxr-xr-x  1 jhuang jhuang 288K Jun 18  2024  mkg_sprechstundenflyer_ver1b_dezember_2019.pdf
-rwxr-xr-x  1 jhuang jhuang  588 Jun 18  2024  multiqc_config.yaml
-rwxr-xr-x  1 jhuang jhuang  38K Jun 18  2024  p11326_OMIKRON3398_corsurv.gb
-rwxr-xr-x  1 jhuang jhuang  30K Jun 18  2024  p11326_OMIKRON3398_corsurv.gb_converted.fna
-rwxr-xr-x  1 jhuang jhuang 3,9K Jun 18  2024  parseGenbank_reformat.py
-rwxr-xr-x  1 jhuang jhuang 222K Jun 18  2024  pangenome-snakemake-master.zip
-rwxr-xr-x  1 jhuang jhuang 283K Jun 18  2024 'phylo tree draft.pdf'
-rwxr-xr-x  1 jhuang jhuang  125 Jun 18  2024  qiime_params.txt
-rwxr-xr-x  1 jhuang jhuang 2,3M Jun 18  2024  pool_b1_CGATGT_300.zip
-rwxr-xr-x  1 jhuang jhuang 5,5K Jun 18  2024  qiime_params_backup.txt
-rwxr-xr-x  1 jhuang jhuang 4,5K Jun 18  2024  qiime_params_s16_s18.txt
-rwxr-xr-x  1 jhuang jhuang   68 Jun 18  2024  snakePipes
-rwxr-xr-x  1 jhuang jhuang  25K Jun 18  2024  results_description.html
-rwxr-xr-x  1 jhuang jhuang 139M Jun 18  2024  rnaalihishapes.tar.gz
-rwxr-xr-x  1 jhuang jhuang 3,4M Jun 18  2024  rnaseq_length_bias.pdf
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024  3932-Leber
drwxr-xr-x  6 jhuang jhuang 4,0K Jun 18  2024  BioPython
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024  Biopython
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024  DEEP-DV
drwxr-xr-x 13 jhuang jhuang 4,0K Jun 18  2024  DOKTORARBEIT
drwxr-xr-x 17 jhuang jhuang 4,0K Jun 18  2024  Data_16S_Arck_vaginal_stool
drwxr-xr-x 22 jhuang jhuang 4,0K Jun 18  2024  Data_16S_BS052
drwxr-xr-x 13 jhuang jhuang 4,0K Jun 18  2024  Data_16S_Birgit
drwxr-xr-x  3 jhuang jhuang 4,0K Jun 18  2024  Data_16S_Christner
drwxr-xr-x  9 jhuang jhuang 4,0K Jun 18  2024  Data_16S_Leonie
drwxr-xr-x 11 jhuang jhuang 4,0K Jun 18  2024  Data_16S_PatientA-G_CSF
drwxr-xr-x 14 jhuang jhuang 4,0K Jun 18  2024  Data_16S_Schaltenberg
drwxr-xr-x  7 jhuang jhuang 4,0K Jun 18  2024  Data_16S_benchmark
drwxr-xr-x  7 jhuang jhuang 4,0K Jun 18  2024  Data_16S_benchmark2
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024  Data_16S_gcdh_BKV
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024  Data_Alex1_Amplicon
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024  Data_Alex1_SNP
drwxr-xr-x  5 jhuang jhuang 4,0K Jun 18  2024  Data_Analysis_for_Life_Science
drwxr-xr-x 19 jhuang jhuang 4,0K Jun 18  2024  Data_Anna13_vanA-Element
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024  Data_Anna14_PACBIO_methylation
drwxr-xr-x  8 jhuang jhuang 4,0K Jun 18  2024  Data_Anna_C.acnes2_old_DEL
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024  Data_Anna_MT880872_update
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024  Data_Anna_gap_filling_agrC
drwxr-xr-x  4 jhuang jhuang 4,0K Jun 18  2024  Data_Baechlein_Hepacivirus_2018
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024  Data_Bornavirus
drwxr-xr-x  3 jhuang jhuang 4,0K Jun 18  2024  Data_CSF
drwxr-xr-x  9 jhuang jhuang 4,0K Jun 18  2024  Data_Christine_cz19-178-rothirsch-bovines-hepacivirus
drwxr-xr-x  4 jhuang jhuang 4,0K Jun 18  2024  Data_Daniela_adenovirus_WGS
drwxr-xr-x  3 jhuang jhuang 4,0K Jun 18  2024  Data_Emilia_MeDIP_DEL
drwxr-xr-x 14 jhuang jhuang 4,0K Jun 18  2024  Data_Francesco2021_16S
drwxr-xr-x  9 jhuang jhuang 4,0K Jun 18  2024  Data_Francesco2021_16S_re
drwxr-xr-x  3 jhuang jhuang 4,0K Jun 18  2024  Data_Gunnar_MS
drwxr-xr-x 10 jhuang jhuang 4,0K Jun 18  2024  Data_Hannes_RNASeq
drwxr-xr-x 29 jhuang jhuang 4,0K Jun 18  2024  Data_Holger_Efaecium_variants_PUBLISHED
drwxr-xr-x  5 jhuang jhuang 4,0K Jun 18  2024  Data_Holger_VRE_DEL
drwxr-xr-x  3 jhuang jhuang 4,0K Jun 18  2024  Data_Icebear_Damian
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024  Data_Indra3_H3K4_p2_DEL
drwxr-xr-x  4 jhuang jhuang 4,0K Jun 18  2024  Data_Indra6_RNASeq_ChipSeq_Integration_DEL
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024  Data_Indra_Figures
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024  Data_KatjaGiersch_new_HDV
drwxr-xr-x  3 jhuang jhuang 4,0K Jun 18  2024  Data_MHH_Encephalitits_DAMIAN
drwxr-xr-x  6 jhuang jhuang 4,0K Jun 18  2024  Data_Manja_RPAChIPSeq_public
drwxr-xr-x 72 jhuang jhuang  12K Jun 18  2024  Data_Manuel_WGS_Yersinia
drwxr-xr-x 32 jhuang jhuang 4,0K Jun 18  2024  Data_Manuel_WGS_Yersinia2_DEL
drwxr-xr-x  4 jhuang jhuang 4,0K Jun 18  2024  Data_Manuel_WGS_Yersinia_DEL
drwxr-xr-x 13 jhuang jhuang 4,0K Jun 18  2024  Data_Marcus_tracrRNA_structures
drwxr-xr-x  5 jhuang jhuang 4,0K Jun 18  2024  Data_Mausmaki_Damian
drwxr-xr-x  4 jhuang jhuang 4,0K Jun 18  2024  Data_Nicole1_Tropheryma_whipplei
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024  Data_Nicole5
drwxr-xr-x  6 jhuang jhuang 4,0K Jun 18  2024  Data_Nicole5_77-92
drwxr-xr-x  3 jhuang jhuang 4,0K Jun 18  2024  Data_PaulBecher_Rotavirus
drwxr-xr-x 21 jhuang jhuang 4,0K Jun 18  2024  Data_Pietschmann_HCV_Amplicon_bigFile
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024  Data_Piscine_Orthoreovirus_3_in_Brown_Trout
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024  Data_Proteomics
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024  Data_RNABioinformatics
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024  Data_RNAKinetics
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024  Data_R_courses
drwxr-xr-x  5 jhuang jhuang 4,0K Jun 18  2024  Data_SARS-CoV-2
drwxr-xr-x  9 jhuang jhuang 4,0K Jun 18  2024  Data_SARS-CoV-2_Genome_Announcement_PUBLISHED
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024  Data_Seite
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024  Data_Song_aggregate_sum
drwxr-xr-x  3 jhuang jhuang 4,0K Jun 18  2024  Data_Susanne_Amplicon_RdRp_orf1_2_re
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024  Data_Tabea_RNASeq
drwxr-xr-x  5 jhuang jhuang 4,0K Jun 18  2024  Data_Thaiss1_Microarray_new
drwxr-xr-x 10 jhuang jhuang 4,0K Jun 18  2024  Data_Tintelnot_16S
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024  Data_Wuenee_Plots
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024  Data_Yang_Poster
drwxr-xr-x  4 jhuang jhuang 4,0K Jun 18  2024  Data_jupnote
drwxr-xr-x 21 jhuang jhuang 4,0K Jun 18  2024  Data_parainfluenza
drwxr-xr-x  9 jhuang jhuang 4,0K Jun 18  2024  Data_snakemake_recipe
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024  Data_temp
drwxr-xr-x  3 jhuang jhuang 4,0K Jun 18  2024  Data_viGEN
drwxr-xr-x 19 jhuang jhuang 4,0K Jun 18  2024  Genomic_Data_Science
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024  Learn_UGENE
drwxr-xr-x  5 jhuang jhuang 4,0K Jun 18  2024  MMcPaper
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024  Manuscript_Epigenetics_Macrophage_Yersinia
drwxr-xr-x  6 jhuang jhuang 4,0K Jun 18  2024  Manuscript_RNAHiSwitch
drwxr-xr-x  4 jhuang jhuang 4,0K Jun 18  2024  MeDIP_Emilia_copy_DEL
drwxr-xr-x  5 jhuang jhuang 4,0K Jun 18  2024  Method_biopython
drwxr-xr-x  4 jhuang jhuang 4,0K Jun 18  2024  NGS
drwxr-xr-x  5 jhuang jhuang 4,0K Jun 18  2024  Okazaki-Seq_Processing
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024  RNA-NGS_Analysis_modul3_NanoStringNorm
drwxr-xr-x  5 jhuang jhuang 4,0K Jun 18  2024  RNAConSLOptV1.2
drwxr-xr-x  9 jhuang jhuang 4,0K Jun 18  2024  RNAHeliCes
drwxr-xr-x 11 jhuang jhuang 4,0K Jun 18  2024  RNA_li_HeliCes
drwxr-xr-x 10 jhuang jhuang 4,0K Jun 18  2024  RNAliHeliCes
drwxr-xr-x 10 jhuang jhuang 4,0K Jun 18  2024  RNAliHeliCes_Relatedshapes_modified
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024  R_refcard
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024  R_DataCamp
drwxr-xr-x  3 jhuang jhuang 4,0K Jun 18  2024  R_cats_package
drwxr-xr-x  9 jhuang jhuang 4,0K Jun 18  2024  R_tutorials-master
drwxr-xr-x  7 jhuang jhuang 4,0K Jun 18  2024  SnakeChunks
drwxr-xr-x  6 jhuang jhuang 4,0K Jun 18  2024  align_4l_on_FJ705359
drwxr-xr-x  5 jhuang jhuang 4,0K Jun 18  2024  align_4p_on_FJ705359
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024  assembly
drwxr-xr-x  6 jhuang jhuang 4,0K Jun 18  2024  bacto
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024  bam2fastq_mapping_again
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024  chipster
drwxr-xr-x  5 jhuang jhuang 4,0K Jun 18  2024  damian_GUI
drwxr-xr-x  4 jhuang jhuang 4,0K Jun 18  2024  enhancer-snakemake-demo
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024  hg19_gene_annotations
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024  interlab_comparison_DEL
drwxr-xr-x  5 jhuang jhuang 4,0K Jun 18  2024  my_flask
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024  papers
drwxr-xr-x  6 jhuang jhuang 4,0K Jun 18  2024  pangenome-snakemake_zhaoc1
drwxr-xr-x 14 jhuang jhuang 4,0K Jun 18  2024  pyflow-epilogos
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024  raw_data_rnaseq_Indra
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024  test_raw_data_dnaseq
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024  test_raw_data_rnaseq
drwxr-xr-x  6 jhuang jhuang 4,0K Jun 18  2024  to_Francesco
drwxr-xr-x 36 jhuang jhuang 4,0K Jun 18  2024  ukepipe
drwxr-xr-x 15 jhuang jhuang 4,0K Jun 18  2024  ukepipe_nf
drwxr-xr-x 17 jhuang jhuang 4,0K Jun 18  2024  var_www_DjangoApp_mysite2_2023-05
-rwxr-xr-x  1 jhuang jhuang  59K Jun 18  2024  roentgenpass.pdf
-rwxr-xr-x  1 jhuang jhuang 9,1M Jun 18  2024  salmon_tx2gene_GRCh38.tsv
-rwxr-xr-x  1 jhuang jhuang 4,1K Jun 18  2024  salmon_tx2gene_chrHsv1.tsv
-rwxr-xr-x  1 jhuang jhuang 8,9K Jun 18  2024 'sample IDs_Lamprecht.xlsx'
-rwxr-xr-x  1 jhuang jhuang  30M Jun 18  2024  summarySCC_PM25.rds
-rwxr-xr-x  1 jhuang jhuang    0 Jun 18  2024  untitled.py
-rwxr-xr-x  1 jhuang jhuang  11M Jun 18  2024  tutorial-rnaseq.pdf
-rwxr-xr-x  1 jhuang jhuang 1,3K Jun 18  2024  x.log
-rwxr-xr-x  1 jhuang jhuang 381M Jun 18  2024  webapp.tar.gz
-rw-rw-r--  1 jhuang jhuang 8,4K Okt  9  2024  temp
-rw-rw-r--  1 jhuang jhuang 2,7K Okt  9  2024  temp2
drwxr-xr-x 51 jhuang jhuang  12K Feb 17 12:23  Data_Susanne_Amplicon_haplotype_analyses_RdRp_orf1_2_re
drwxr-xr-x  6 jhuang jhuang 4,0K Feb 17 12:42  Data_Susanne_WGS_unbiased

jhuang@WS-2290C:~/DATA_D$ ls -tlrh
total 56K
lrwxrwxrwx  1 jhuang jhuang   59 Apr 11  2024 Data_Soeren_RNA-seq_2023_PUBLISHING -> /media/jhuang/Elements/Data_Soeren_RNA-seq_2023_PUBLISHING/
lrwxrwxrwx  1 jhuang jhuang   32 Apr 11  2024 Data_Ute -> /media/jhuang/Elements/Data_Ute/
lrwxrwxrwx  1 jhuang jhuang   52 Apr 23  2024 Data_Marc_RNA-seq_Sepidermidis -> /media/jhuang/Titisee/Data_Marc_RNA-seq_Sepidermidis
drwxrwxr-x  2 jhuang jhuang 4,0K Mai  2  2024 Data_Patricia_Transposon
drwxrwxr-x  2 jhuang jhuang 4,0K Mai 29  2024 Books_DA_for_Life
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 18  2024 Data_Sven
-rw-rw-r--  1 jhuang jhuang 2,9K Jul 16  2024 Datasize_calculation_based_on_coverage.txt
drwxr-xr-x  6 jhuang jhuang 4,0K Jul 23  2024 Data_Paul_HD46_1-wt_resequencing
drwxrwxr-x  2 jhuang jhuang 4,0K Jul 26  2024 Data_Sanam_DAMIAN
drwxrwxr-x 26 jhuang jhuang  12K Jul 30  2024 Data_Tam_variant_calling
drwxrwxr-x  2 jhuang jhuang 4,0K Aug 26  2024 Data_Samira_Manuscripts
drwxrwxr-x  2 jhuang jhuang 4,0K Aug 27  2024 Data_Silvia_VoltRon_Debug
drwxrwxr-x 38 jhuang jhuang 4,0K Jun 10  2025 Data_Pietschmann_229ECoronavirus_Mutations_2024
drwxrwxr-x 23 jhuang jhuang 4,0K Jun 25  2025 Data_Pietschmann_229ECoronavirus_Mutations_2025
lrwxrwxrwx  1 jhuang jhuang   63 Nov 24 16:30 Data_Birthe_Svenja_RSV_Probe3_PUBLISHING -> /media/jhuang/Elements/Data_Birthe_Svenja_RSV_Probe3_PUBLISHING

jhuang@WS-2290C:~/DATA_E$ ls -tlrh
total 119M
drwxr-xr-x 10 jhuang jhuang 4,0K Apr 18  2019 j_huang_until_201904
drwxr-xr-x  2 jhuang jhuang 4,0K Apr 29  2019 Data_2019_April
drwxr-xr-x  2 jhuang jhuang 4,0K Mai 10  2019 Data_2019_May
drwxr-xr-x  2 jhuang jhuang 4,0K Jun 17  2019 Data_2019_June
drwxr-xr-x  2 jhuang jhuang 4,0K Jul 12  2019 Data_2019_July
drwxr-xr-x  3 jhuang jhuang 4,0K Aug 29  2019 Data_2019_August
drwxr-xr-x  3 jhuang jhuang 4,0K Sep  5  2019 Data_2019_September
drwxr-xr-x 11 jhuang jhuang 4,0K Apr 18  2023 Data_Song_RNASeq_PUBLISHED
drwxr-xr-x  7 jhuang jhuang 4,0K Okt 10  2023 Data_Laura_MP_RNASeq
drwxr-xr-x 22 jhuang jhuang  12K Nov  3  2023 Data_Nicole6_HEV_Swantje2
drwxr-xr-x 17 jhuang jhuang 4,0K Nov 13  2023 Data_Becher_Damian_Picornavirus_BovHepV
-rwxr-xr-x  1 jhuang jhuang 118M Nov 28  2023 bacteria_refseq.zip
drwxr-xr-x  3 jhuang jhuang 4,0K Nov 30  2023 bacteria_refseq
drwxr-xr-x  8 jhuang jhuang 4,0K Nov 30  2023 Data_Rotavirus
drwxr-xr-x  6 jhuang jhuang 4,0K Dez  6  2023 Data_Xiaobo_10x
drwx------ 17 jhuang jhuang 4,0K Feb  7  2025 Data_Becher_Damian_Picornavirus_BovHepV_INCOMPLETE_DEL

jhuang@WS-2290C:~/DATA_Intenso$ ls -ltrh
total 4,1G
drwxr-xr-x  15 jhuang jhuang 4,0K Mär 30  2015  HOME_FREIBURG_DEL
drwxr-xr-x   2 jhuang jhuang 4,0K Aug 12  2015  150810_M03701_0019_000000000-AFJFK
drwxr-xr-x   5 jhuang jhuang 4,0K Jan 31  2017  Data_Thaiss2_Microarray
drwxr-xr-x   9 jhuang jhuang 4,0K Apr 27  2017  VirtualBox_VMs_DEL
drwxr-xr-x   7 jhuang jhuang 4,0K Apr 27  2017 'VirtualBox VMs_DEL'
drwxr-xr-x   7 jhuang jhuang 4,0K Apr 27  2017 'VirtualBox VMs2_DEL'
drwxr-xr-x  16 jhuang jhuang 4,0K Mai 12  2017  websites
drwxr-xr-x   5 jhuang jhuang 4,0K Jun 29  2017  DATA
drwxr-xr-x 149 jhuang jhuang  36K Jun 30  2017  Data_Laura
drwxr-xr-x 149 jhuang jhuang  36K Jun 30  2017  Data_Laura_2
drwxr-xr-x   3 jhuang jhuang 4,0K Jun 30  2017  Data_Laura_3
drwxr-xr-x   7 jhuang jhuang 4,0K Jul 10  2017  galaxy_tools
drwxr-xr-x  45 jhuang jhuang  32K Jul 17  2017  Downloads2
drwxr-xr-x   3 jhuang jhuang 4,0K Jul 27  2017  Downloads
drwxr-xr-x   3 jhuang jhuang 4,0K Jul 28  2017  mom-baby_com_cn
drwxr-xr-x   3 jhuang jhuang 4,0K Aug  8  2017 'VirtualBox VMs2'
drwxr-xr-x   3 jhuang jhuang 4,0K Aug  9  2017  VirtualBox_VMs
drwxr-xr-x   3 jhuang jhuang 4,0K Aug 11  2017  CLC_Data
drwxr-xr-x   6 jhuang jhuang  12K Aug 14  2017  Work_Dir2
drwxr-xr-x   7 jhuang jhuang 4,0K Aug 15  2017  Work_Dir2_SGE
drwxr-xr-x  19 jhuang jhuang 4,0K Aug 24  2017  Data_SPANDx1_Kpneumoniae_vs_Assembly1
drwxr-xr-x   3 jhuang jhuang 4,0K Aug 24  2017  MauveOutput
drwxr-xr-x   3 jhuang jhuang 4,0K Aug 31  2017  Fastqs
drwxr-xr-x  20 jhuang jhuang 4,0K Sep  7  2017  Data_Anna3_VRE_Ausbruch
drwxr-xr-x   8 jhuang jhuang 4,0K Sep 19  2017  Work_Dir_mock_broad_mockinput
drwxr-xr-x   8 jhuang jhuang 4,0K Sep 19  2017  Work_Dir_dM_broad_mockinput
drwxr-xr-x   4 jhuang jhuang 4,0K Okt  6  2017  Data_Anna8_RNASeq_static_shake_deprecated
drwxr-xr-x  24 jhuang jhuang 4,0K Okt  9  2017  PENDRIVE_cont
drwxr-xr-x   8 jhuang jhuang 4,0K Okt 23  2017  Work_Dir_WAP_broad_mockinput
drwxr-xr-x  10 jhuang jhuang 4,0K Okt 23  2017  Work_Dir_WAC_broad_mockinput
drwxr-xr-x  11 jhuang jhuang 4,0K Okt 23  2017  Work_Dir_dP_broad_mockinput
drwxr-xr-x  52 jhuang jhuang 4,0K Nov  8  2017  Data_Nicole10_16S_interlab
drwxr-xr-x   6 jhuang jhuang 4,0K Dez  6  2017  PAPERS
drwxr-xr-x  14 jhuang jhuang  16K Dez 15  2017  TB
drwxr-xr-x   5 jhuang jhuang 4,0K Dez 19  2017  Data_Anna4_SNP
drwxr-xr-x  11 jhuang jhuang 4,0K Jan 16  2018  Data_Carolin1_16S
drwxr-xr-x   2 jhuang jhuang 4,0K Jan 22  2018  ChipSeq_Raw_Data3_171009_NB501882_0024_AHNGTYBGX3
-rw-r--r--   1 jhuang jhuang 4,0G Jan 23  2018  m_aepfelbacher_DEL.zip
drwxr-xr-x   7 jhuang jhuang 4,0K Jan 24  2018  Data_Anna7_RNASeq_Cytoscape
drwxr-xr-x   3 jhuang jhuang 4,0K Jan 24  2018  Data_Nicole9_Hund_Katze_Mega
drwxr-xr-x  39 jhuang jhuang  20K Jan 28  2018  Data_Anna2_CO6114
drwxr-xr-x   3 jhuang jhuang 4,0K Jan 28  2018  Data_Nicole3_TH17_orig
drwxr-xr-x  27 jhuang jhuang  28K Jan 28  2018  Data_Nicole1_Tropheryma_whipplei
drwxr-xr-x  16 jhuang jhuang 4,0K Jan 30  2018  results_K27
drwxr-xr-x   2 jhuang jhuang 4,0K Feb 19  2018 'VirtualBox VMs'
drwxr-xr-x  28 jhuang jhuang  12K Feb 27  2018  Data_Anna6_RNASeq
drwxr-xr-x  17 jhuang jhuang  12K Mär  1  2018  Data_Anna1_1585_RNAseq
drwxr-xr-x  21 jhuang jhuang 4,0K Mär  7  2018  Data_Thaiss1_Microarray
drwxr-xr-x  25 jhuang jhuang  12K Mär 27  2018  Data_Nicole7_Anelloviruses_Polyomavirus
drwxr-xr-x  13 jhuang jhuang 4,0K Mai 22  2018  Data_Nina1_Nicole5_1-76
drwxr-xr-x  11 jhuang jhuang 4,0K Mai 22  2018  Data_Nina1_merged
drwxr-xr-x  32 jhuang jhuang 4,0K Jun 14  2018  Data_Nicole8_Lamprecht
drwxr-xr-x  40 jhuang jhuang  16K Jul  5  2018  Data_Anna5_SNP
drwxr-xr-x  35 jhuang jhuang 4,0K Okt 12  2018  chipseq
drwxr-xr-x 107 jhuang jhuang  76K Mai 18  2019  Downloads_DEL
drwxr-xr-x   7 jhuang jhuang 4,0K Mär 17  2020  Data_Gagliani2_enriched_16S
drwxr-xr-x  17 jhuang jhuang 4,0K Mär 17  2020  Data_Gagliani1_18S_16S
drwxr-xr-x   2 jhuang jhuang 4,0K Apr  2  2020  m_aepfelbacher
drwxr-xr-x   4 jhuang jhuang 4,0K Feb 17 12:38  Data_Susanne_WGS_3amplicons

jhuang@WS-2290C:/media/jhuang/Titisee$ ls -tlrh
total 3,5G
drwxrwxrwx 1 jhuang jhuang    0 Dez 19  2017  Data_Anna4_SNP
drwxrwxrwx 1 jhuang jhuang 4,0K Jan 24  2018  Data_Anna5_SNP_rsync_error
-rwxrwxrwx 1 jhuang jhuang 9,9K Mär 21  2018  TRASH
drwxrwxrwx 1 jhuang jhuang  20K Mär 28  2018  Data_Nicole6_HEV_4_SNP_calling_PE_DEL
drwxrwxrwx 1 jhuang jhuang 4,0K Mai 22  2018  Data_Nina1_Nicole7
drwxrwxrwx 1 jhuang jhuang 8,0K Mai 24  2018  Data_Nicole6_HEV_4_SNP_calling_SE_DEL
-rwxrwxrwx 1 jhuang jhuang 3,5G Jun 14  2018  180119_M03701_0115_000000000-BFG46.zip
drwxrwxrwx 1 jhuang jhuang 4,0K Jul 10  2018  Data_Nicole10_16S_interlab_PUBLISHED
drwxrwxrwx 1 jhuang jhuang 4,0K Jul 10  2018  Anna11_assemblies
drwxrwxrwx 1 jhuang jhuang 4,0K Jul 11  2018  Anna11_trees
drwxrwxrwx 1 jhuang jhuang 4,0K Jul 24  2018  Data_Nicole6_HEV_new_orig_fastqs
drwxrwxrwx 1 jhuang jhuang 4,0K Nov 23  2018  Data_Anna9_OXA-48_or_OXA-181
drwxrwxrwx 1 jhuang jhuang 4,0K Feb 15  2019  bengal_results_v1_2018
-rwxrwxrwx 1 jhuang jhuang 9,8M Mär 22  2019  DO.pdf
drwxrwxrwx 1 jhuang jhuang 4,0K Mai  6  2019  damian_DEL
drwxrwxrwx 1 jhuang jhuang    0 Mai 20  2019  MAGpy_db
drwxrwxrwx 1 jhuang jhuang    0 Aug  3  2019  UGENE_v1_32_data_cistrome
drwxrwxrwx 1 jhuang jhuang 4,0K Aug  3  2019  UGENE_v1_32_data_ngs_classification
drwxrwxrwx 1 jhuang jhuang  52K Okt 25  2019  Data_Nicole6_HEV_Swantje
drwxrwxrwx 1 jhuang jhuang 8,0K Okt 25  2019  Data_Nico_Gagliani
drwxrwxrwx 1 jhuang jhuang 4,0K Mär 30  2020  GAMOLA2_prototyp
drwxrwxrwx 1 jhuang jhuang 8,0K Mär 31  2020  Thomas_methylation_EPIC_DO
drwxrwxrwx 1 jhuang jhuang 8,0K Jun 15  2020  Data_Nicola_Schaltenberg
drwxrwxrwx 1 jhuang jhuang  36K Jun 25  2020  Data_Nicola_Schaltenberg_PICRUSt
drwxrwxrwx 1 jhuang jhuang  12K Jan 25  2021  HOME_FREIBURG
drwxrwxrwx 1 jhuang jhuang 4,0K Okt 13  2021  Data_Francesco_16S
drwxrwxrwx 1 jhuang jhuang 4,0K Jun 14  2022  3rd_party
drwxrwxrwx 1 jhuang jhuang 4,0K Jul 29  2022  ConsPred_prokaryotic_genome_annotation
drwxrwxrwx 1 jhuang jhuang 4,0K Aug  2  2022 'System Volume Information'
drwxrwxrwx 1 jhuang jhuang    0 Sep 16  2022  damian_v201016
drwxrwxrwx 1 jhuang jhuang  36K Jan 12  2023  Data_Holger_VRE
drwxrwxrwx 1 jhuang jhuang  32K Feb  1  2023  Data_Holger_Pseudomonas_aeruginosa_SNP
drwxrwxrwx 1 jhuang jhuang 4,0K Sep  5  2023  Eigene_Ordner_HR
drwxrwxrwx 1 jhuang jhuang  24K Sep  6  2023  GAMOLA2
drwxrwxrwx 1 jhuang jhuang  24K Sep 27  2023  Data_Anastasia_RNASeq
drwxrwxrwx 1 jhuang jhuang  24K Okt 20  2023  Data_Amir_PUBLISHED
drwxrwxrwx 1 jhuang jhuang  44K Apr 25  2024  Data_Marc_RNA-seq_Sepidermidis
drwxrwxrwx 1 jhuang jhuang 4,0K Sep 23  2024 '$RECYCLE.BIN'
drwxrwxrwx 1 jhuang jhuang 4,0K Sep 23  2024  Data_Xiaobo_10x_3
drwxrwxrwx 1 jhuang jhuang  24K Nov 28  2024  Data_Tam_DNAseq_2023_Comparative_ATCC19606_AYE_ATCC17978
drwxrwxrwx 1 jhuang jhuang  48K Dez 19  2024  Data_Holger_S.epidermidis_short
-rwxrwxrwx 1 jhuang jhuang   31 Feb  4  2025  TEMP
drwxrwxrwx 1 jhuang jhuang  12K Aug 22 11:44  Data_Holger_S.epidermidis_long

jhuang@WS-2290C:/media/jhuang/Elements(Denise_ChIPseq)$ ls -tlrh
total 11M
drwxr-xr-x 1 jhuang jhuang 4,0K Jun  7  2019  Data_Denise_LTtrunc_H3K27me3_2_results_DEL
drwxr-xr-x 1 jhuang jhuang 4,0K Jun  7  2019  Data_Denise_LTtrunc_H3K4me3_2_results_DEL
drwxr-xr-x 1 jhuang jhuang  28K Aug 26  2019  Data_Anna12_HAPDICS_final_not_finished_DEL
drwxr-xr-x 1 jhuang jhuang 4,0K Okt 24  2019  m_aepfelbacher_DEL
drwxr-xr-x 1 jhuang jhuang  20K Jan 14  2020  Data_Damian
drwxr-xr-x 1 jhuang jhuang 4,0K Jan 25  2020  ST772_DEL
drwxr-xr-x 1 jhuang jhuang 160K Jan 25  2020  ALL_trimmed_part_DEL
drwxr-xr-x 1 jhuang jhuang    0 Mär 30  2020  Data_Denise_ChIPSeq_Protocol1
drwxr-xr-x 1 jhuang jhuang  44K Mai 19  2020  Data_Pietschmann_HCV_Amplicon
drwxr-xr-x 1 jhuang jhuang  60K Jun 26  2020  Data_Nicole6_HEV_ownMethod_new
-rwxr-xr-x 1 jhuang jhuang 2,5M Aug  5  2020  HD04-1.fasta
drwxr-xr-x 1 jhuang jhuang 4,0K Mai 31  2021  RNAHiSwitch_
drwxr-xr-x 1 jhuang jhuang 4,0K Mai 31  2021  RNAHiSwitch__
drwxr-xr-x 1 jhuang jhuang 8,0K Jun 17  2021  RNAHiSwitch___
drwxr-xr-x 1 jhuang jhuang 4,0K Jun 25  2021  RNAHiSwitch_paper_
drwxr-xr-x 1 jhuang jhuang    0 Jul  7  2021  RNAHiSwitch_milestone1_DELETED
-rwxr-xr-x 1 jhuang jhuang 7,2M Jul  7  2021  RNAHiSwitch_paper.tar.gz
drwxr-xr-x 1 jhuang jhuang 4,0K Jul 12  2021  RNAHiSwitch_paper_DELETED
drwxr-xr-x 1 jhuang jhuang  12K Jul 12  2021  RNAHiSwitch_milestone1
drwxr-xr-x 1 jhuang jhuang 4,0K Aug 23  2021  RNAHiSwitch_paper
drwxr-xr-x 1 jhuang jhuang 4,0K Sep 24  2021  Ute_RNASeq_results
drwxr-xr-x 1 jhuang jhuang 4,0K Sep 24  2021  Ute_miRNA_results_38
drwxr-xr-x 1 jhuang jhuang  88K Okt 27  2021  RNAHiSwitch
drwxr-xr-x 1 jhuang jhuang  48K Mär 31  2022  Data_HepE_Freiburg_PUBLISHED
drwxr-xr-x 1 jhuang jhuang 4,0K Jun  1  2022  Data_INTENSO_2022-06
drwxr-xr-x 1 jhuang jhuang    0 Sep 14  2022 '$RECYCLE.BIN'
drwxr-xr-x 1 jhuang jhuang 4,0K Sep 14  2022 'System Volume Information'
drwxr-xr-x 1 jhuang jhuang 4,0K Dez  7  2022  Data_Anna_Mixta_hanseatica_PUBLISHED
-rwxr-xr-x 1 jhuang jhuang  33K Dez  9  2022  coi_disclosure.docx
drwxr-xr-x 1 jhuang jhuang  20K Feb  8  2023  Data_Jingang
drwxr-xr-x 1 jhuang jhuang 4,0K Mai 30  2023  Data_Arck_16S_MMc_PUBLISHED
drwxr-xr-x 1 jhuang jhuang 4,0K Jun  5  2023  Data_Laura_ChIPseq_GSE120945
drwxr-xr-x 1 jhuang jhuang  80K Jun  5  2023  Data_Nicole6_HEV_ownMethod
drwxr-xr-x 1 jhuang jhuang 8,0K Jul  5  2023  Data_Susanne_16S_re_UNPUBLISHED *
drwxr-xr-x 1 jhuang jhuang 4,0K Okt 12  2023  Data_Denise_ChIPSeq_Protocol2
drwxr-xr-x 1 jhuang jhuang 4,0K Okt 20  2023  Data_Caroline_RNAseq_wt_timecourse
drwxr-xr-x 1 jhuang jhuang 4,0K Okt 20  2023  Data_Caroline_RNAseq_brain_organoids
drwxr-xr-x 1 jhuang jhuang  20K Okt 20  2023  Data_Amir_PUBLISHED_DEL
drwxr-xr-x 1 jhuang jhuang 4,0K Nov 24  2023  Data_download_virus_fam
drwxr-xr-x 1 jhuang jhuang  12K Feb 22  2024  Data_Gunnar_Yersiniomics_COPYFAILED_DEL
drwxr-xr-x 1 jhuang jhuang  20K Feb 27  2024  Data_Paul_and_Marc_Epidome_batch3
-rwxr-xr-x 1 jhuang jhuang 3,0K Okt 30  2024  ifconfig_hamm.txt
drwxr-xr-x 1 jhuang jhuang 8,0K Apr  8  2025  Data_Soeren_2023_PUBLISHING
drwxr-xr-x 1 jhuang jhuang  28K Nov 24 13:34  Data_Birthe_Svenja_RSV_Probe3_PUBLISHING
drwxr-xr-x 1 jhuang jhuang  20K Jan 13 17:46  Data_Ute
drwxr-xr-x 1 jhuang jhuang  12K Feb 17 12:48  Data_Susanne_16S_UNPUBLISHED *

jhuang@WS-2290C:/media/jhuang/Seagate Expansion Drive(HOffice)$ ls -tlrh
total 19M
-rwxrwxrwx 1 jhuang jhuang 550K Jan  8  2015  SeagateExpansion.ico
-rwxrwxrwx 1 jhuang jhuang   38 Mär 27  2015  Autorun.inf
-rwxrwxrwx 2 jhuang jhuang  18M Mai  4  2017  Start_Here_Win.exe
-rwxrwxrwx 1 jhuang jhuang 1,1M Jul  7  2017  Warranty.pdf
drwxrwxrwx 1 jhuang jhuang    0 Jan  9  2018  Start_Here_Mac.app
drwxrwxrwx 1 jhuang jhuang    0 Jan  9  2018  Seagate
drwxrwxrwx 1 jhuang jhuang    0 Jun  5  2024  HomeOffice_DIR (Data_Anna_HAPDICS_RNASeq, From_Samsung_T5)
drwxrwxrwx 1 jhuang jhuang 4,0K Jun 17  2024  DATA_COPY_FROM_178528 (copy_and_clean.sh, logfile_jhuang.log, jhuang)
drwxrwxrwx 1 jhuang jhuang    0 Sep  9 10:41 'System Volume Information'
drwxrwxrwx 1 jhuang jhuang    0 Sep  9 10:41 '$RECYCLE.BIN'

jhuang@WS-2290C:/media/jhuang/Elements(Anna_C.arnes)$ ls -trlh
total 236K
drwxrwxrwx 1 jhuang jhuang 8,0K Nov 14  2018  Data_Swantje_HEV_using_viral-ngs
drwxrwxrwx 1 jhuang jhuang    0 Dez  4  2018  VIPER_static_DEL
drwxrwxrwx 1 jhuang jhuang 4,0K Apr  4  2019  Data_Nicole6_HEV_Swantje1_blood
drwxrwxrwx 1 jhuang jhuang  24K Apr  5  2019  Data_Nicole6_HEV_benchmark
drwxrwxrwx 1 jhuang jhuang  20K Mär 12  2020  Data_Denise_RNASeq_GSE79958
drwxrwxrwx 1 jhuang jhuang 8,0K Jan 11  2022  Data_16S_Leonie_from_Nico_Gaglianis
drwxrwxrwx 1 jhuang jhuang 8,0K Jul 29  2022  Fastqs_19-21
drwxrwxrwx 1 jhuang jhuang 4,0K Aug  2  2022 'System Volume Information'
drwxrwxrwx 1 jhuang jhuang 8,0K Sep 23  2022  Data_Luise_Epidome_test
drwxrwxrwx 1 jhuang jhuang  48K Sep 27  2023  Data_Anna_C.acnes_PUBLISHED
drwxrwxrwx 1 jhuang jhuang  24K Dez  6  2023  Data_Denise_LT_DNA_Bindung
drwxrwxrwx 1 jhuang jhuang 4,0K Jan  9  2024  Data_Denise_LT_K331A_RNASeq
drwxrwxrwx 1 jhuang jhuang  12K Jan 10  2024  Data_Luise_Epidome_batch1
drwxrwxrwx 1 jhuang jhuang  28K Feb 26  2024  Data_Luise_Pseudomonas_aeruginosa_PUBLISHED
drwxrwxrwx 1 jhuang jhuang  28K Feb 27  2024  Data_Luise_Epidome_batch2
drwxrwxrwx 1 jhuang jhuang 4,0K Sep  5  2024  picrust2_out_2024_2
drwxrwxrwx 1 jhuang jhuang 4,0K Mär 11  2025 '$RECYCLE.BIN'

jhuang@WS-2290C:/media/jhuang/Seagate Expansion Drive(DATA_COPY_FROM_hamburg)$ ls -tlrh
total 19M
-rwxrwxrwx 1 jhuang jhuang   33 Feb 21  2018 Autorun.inf
-rwxrwxrwx 2 jhuang jhuang  18M Jun 21  2019 Start_Here_Win.exe
-rwxrwxrwx 1 jhuang jhuang 1,6M Jul  6  2020 Warranty.pdf
drwxrwxrwx 1 jhuang jhuang    0 Mär 10  2021 Start_Here_Mac.app
drwxrwxrwx 1 jhuang jhuang    0 Mär 10  2021 Seagate
drwxrwxrwx 1 jhuang jhuang  12K Jun 29  2022 DATA_COPY_TRANSFER_INCOMPLETE_DEL
drwxrwxrwx 1 jhuang jhuang 4,0K Dez 16  2024 DATA_COPY_FROM_hamburg

jhuang@WS-2290C:/media/jhuang/Seagate Expansion Drive(Seagate_1)$ ls -trlh
total 104G
drwxr-xr-x 1 jhuang jhuang 4,0K Okt  3  2013  RNA_seq_analysis_tools_2013
drwxr-xr-x 1 jhuang jhuang    0 Feb 28  2018  Data_Laura0
drwxr-xr-x 1 jhuang jhuang 8,0K Sep  6  2018  Data_Petra_Arck
drwxr-xr-x 1 jhuang jhuang 4,0K Sep 14  2018  Data_Martin_mycoplasma
drwxr-xr-x 1 jhuang jhuang 8,0K Dez  5  2018  chromhmm-enhancers
drwxr-xr-x 1 jhuang jhuang 4,0K Jan 15  2019  ChromHMM_Dir
drwxr-xr-x 1 jhuang jhuang 4,0K Jan 18  2019  Data_Denise_sT_H3K4me3
drwxr-xr-x 1 jhuang jhuang 4,0K Jan 18  2019  Data_Denise_sT_H3K27me3
drwxr-xr-x 1 jhuang jhuang    0 Feb 13  2019  Start_Here_Mac.app
drwxr-xr-x 1 jhuang jhuang    0 Feb 13  2019  Seagate
drwxr-xr-x 1 jhuang jhuang 4,0K Feb 19  2019  Data_Nicole16_parapoxvirus
-rwxr-xr-x 1 jhuang jhuang  39G Aug 20  2019  Project_h_rohde_Susanne_WGS_unbiased_DEL.zip
drwxr-xr-x 1 jhuang jhuang 4,0K Nov 11  2019  Data_Denise_ChIPSeq_Protocol1
drwxr-xr-x 1 jhuang jhuang 8,0K Nov 13  2019  Data_ENNGS_pathogen_detection_pipeline_comparison
drwxr-xr-x 1 jhuang jhuang 4,0K Feb 18  2020  j_huang_201904_202002
-rwxr-xr-x 1 jhuang jhuang  112 Mär  2  2020  Data_Laura_ChIPseq_GSE120945
drwxr-xr-x 1 jhuang jhuang 8,0K Mär 26  2020  batch_200314_incomplete
-rwxr-xr-x 1 jhuang jhuang  65G Mär 26  2020  m_aepfelbacher.zip
drwxr-xr-x 1 jhuang jhuang    0 Mär 26  2020  m_error_DEL
drwxr-xr-x 1 jhuang jhuang 4,0K Mär 28  2020  batch_200325
drwxr-xr-x 1 jhuang jhuang 4,0K Mär 28  2020  batch_200319
drwxr-xr-x 1 jhuang jhuang 4,0K Mär 30  2020  GAMOLA2_prototyp
drwxr-xr-x 1 jhuang jhuang 4,0K Jun 22  2020  Data_Nicola_Gagliani
drwxr-xr-x 1 jhuang jhuang 4,0K Sep  3  2020  2017-18_raw_data
drwxr-xr-x 1 jhuang jhuang 1,2M Sep 11  2020  Data_Arck_MeDIP
drwxr-xr-x 1 jhuang jhuang 4,0K Okt 16  2020  trimmed
drwxr-xr-x 1 jhuang jhuang 4,0K Dez 23  2020  Data_Nicole_16S_Christmas_2020_2
drwxr-xr-x 1 jhuang jhuang 4,0K Jan 14  2021  j_huang_202007_202012
drwxr-xr-x 1 jhuang jhuang 4,0K Jan 15  2021  Data_Nicole_16S_Christmas_2020
drwxr-xr-x 1 jhuang jhuang 184K Jan 18  2021  Downloads_2021-01-18_DEL
drwxr-xr-x 1 jhuang jhuang 4,0K Jan 28  2021  Data_Laura_plasmid
drwxr-xr-x 1 jhuang jhuang 4,0K Mär 18  2021  Data_Laura_16S_2_re
drwxr-xr-x 1 jhuang jhuang 8,0K Mär 22  2021  Data_Laura_16S_2
drwxr-xr-x 1 jhuang jhuang 4,0K Mär 22  2021  Data_Laura_16S_2_re_
drwxr-xr-x 1 jhuang jhuang 8,0K Mär 23  2021  Data_Laura_16S_merged
drwxr-xr-x 1 jhuang jhuang  32K Nov  7  2022  Downloads_DEL
drwxr-xr-x 1 jhuang jhuang  12K Nov  7  2022  Data_Laura_16S
drwxr-xr-x 1 jhuang jhuang  76K Nov  9  2023  Data_Anna12_HAPDICS_final
drwxr-xr-x 1 jhuang jhuang    0 Dez  4  2023 '$RECYCLE.BIN'
drwxr-xr-x 1 jhuang jhuang 4,0K Dez  4  2023 'System Volume Information'

jhuang@WS-2290C:/media/jhuang/Seagate Expansion Drive(Seagate_2)$ ls -trlh
total 70G
drwxr-xr-x 1 jhuang jhuang 4,0K Jan  5  2017  Data_Nicole4_TH17
-rwxr-xr-x 1 jhuang jhuang  18M Feb  9  2018  Start_Here_Win.exe
-rwxr-xr-x 1 jhuang jhuang   33 Feb 21  2018  Autorun.inf
-rwxr-xr-x 1 jhuang jhuang 1,2M Jul 26  2018  Warranty.pdf
drwxr-xr-x 1 jhuang jhuang    0 Feb 13  2019  Start_Here_Mac.app
drwxr-xr-x 1 jhuang jhuang    0 Feb 13  2019  Seagate
drwxr-xr-x 1 jhuang jhuang 4,0K Dez 20  2019  Data_Denise_RNASeq_trimmed_DEL
drwxr-xr-x 1 jhuang jhuang 4,0K Jan 25  2020  HD12
drwxr-xr-x 1 jhuang jhuang 4,0K Jan 25  2020  Qi_panGenome
drwxr-xr-x 1 jhuang jhuang  44K Jan 25  2020  ALL
drwxr-xr-x 1 jhuang jhuang    0 Feb 14  2020  fastq_HPI_bw_2019_08_and_2020_02
-rwxr-xr-x 1 jhuang jhuang  19K Mär 12  2020  f1_R1_link.sh
-rwxr-xr-x 1 jhuang jhuang  19K Mär 12  2020  f1_R2_link.sh
drwxr-xr-x 1 jhuang jhuang  28K Mär 19  2020  rtpd_files
-rwxr-xr-x 1 jhuang jhuang  65G Apr  2  2020  m_aepfelbacher.zip
drwxr-xr-x 1 jhuang jhuang 4,0K Apr 20  2020  Data_Nicole_16S_Hamburg_Odense_Cornell_Muenster
drwxr-xr-x 1 jhuang jhuang 8,0K Apr 21  2020  HyAsP_incomplete_genomes
drwxr-xr-x 1 jhuang jhuang 4,0K Apr 25  2020  HyAsP_normal_sampled_input
drwxr-xr-x 1 jhuang jhuang 8,0K Apr 28  2020  HyAsP_complete_genomes
-rwxr-xr-x 1 jhuang jhuang 176M Mai  8  2020  video.zip
-rwxr-xr-x 1 jhuang jhuang 6,9K Jun  2  2020  sam2bedgff.pl
-rwxr-xr-x 1 jhuang jhuang 5,5K Jul  7  2020  HD04.infection.hS_vs_HD04.nose.hS_annotated_degenes.xls
drwxr-xr-x 1 jhuang jhuang  44K Jul  9  2020  ALL83
drwxr-xr-x 1 jhuang jhuang  20K Jul  9  2020  Data_Pietschmann_RSV_Probe_PUBLISHED
drwxr-xr-x 1 jhuang jhuang 8,0K Jul 27  2020  HyAsP_normal
drwxr-xr-x 1 jhuang jhuang 4,0K Jul 28  2020  Data_Manthey_16S
drwxr-xr-x 1 jhuang jhuang 8,0K Jul 29  2020  rtpd_files_DEL
drwxr-xr-x 1 jhuang jhuang  20K Aug 11  2020  HyAsP_bold
drwxr-xr-x 1 jhuang jhuang  44K Aug 17  2020  Data_HEV
drwxr-xr-x 1 jhuang jhuang 4,0K Sep 29  2020  Seq_VRE_hybridassembly
drwxr-xr-x 1 jhuang jhuang  12K Nov 11  2020  Data_Anna12_HAPDICS_raw_data_shovill_prokka
drwxr-xr-x 1 jhuang jhuang  12K Aug 10  2021  Data_Anna_HAPDICS_WGS_ALL
drwxr-xr-x 1 jhuang jhuang  20K Aug 10  2021  Data_HEV_Freiburg_2020
drwxr-xr-x 1 jhuang jhuang  20K Okt 27  2021  Data_Nicole_HDV_Recombination_PUBLISHED
-rwxr-xr-x 1 jhuang jhuang 905K Feb  8  2022  s_hero2x
-rwxr-xr-x 1 jhuang jhuang 5,5G Feb 25  2022  201030_M03701_0207_000000000-J57B4.zip
-rwxr-xr-x 1 jhuang jhuang 4,9K Mär 21  2022  README
-rwxr-xr-x 1 jhuang jhuang 4,9K Mär 21  2022 'README(1)'
-rwxr-xr-x 1 jhuang jhuang  848 Mär 28  2022  dna2.fasta.fai
-rwxr-xr-x 1 jhuang jhuang  17K Mär 28  2022  91.pep
-rwxr-xr-x 1 jhuang jhuang 9,1K Mär 28  2022  91.orf
-rwxr-xr-x 1 jhuang jhuang  222 Mär 28  2022  91.orf.fai
-rwxr-xr-x 1 jhuang jhuang 1,1M Mär 31  2022  dgaston-dec-06-2012-121211124858-phpapp01.pdf
-rwxr-xr-x 1 jhuang jhuang 5,2K Apr  4  2022  tileshop.fcgi
-rwxr-xr-x 1 jhuang jhuang 765K Apr  4  2022  ppat.1009304.s016.tif
-rwxr-xr-x 1 jhuang jhuang 4,1K Mai  2  2022  sequence.txt
-rwxr-xr-x 1 jhuang jhuang 4,0K Mai  2  2022 'sequence(1).txt'
-rwxr-xr-x 1 jhuang jhuang 3,7K Mai 23  2022  GSE128169_series_matrix.txt.gz
-rwxr-xr-x 1 jhuang jhuang 4,0K Mai 23  2022  GSE128169_family.soft.gz
drwxr-xr-x 1 jhuang jhuang  40K Mär 20  2023  Data_Anna_HAPDICS_RNASeq
drwxr-xr-x 1 jhuang jhuang 1,3M Apr  4  2023  Data_Christopher_MeDIP_MMc_PUBLISHED
drwxr-xr-x 1 jhuang jhuang 8,0K Jun 28  2023  Data_Gunnar_Yersiniomics_IMCOMPLETE_DEL
drwxr-xr-x 1 jhuang jhuang  28K Feb 12  2024  Data_Denise_RNASeq
drwxr-xr-x 1 jhuang jhuang 4,0K Apr  5  2024 'System Volume Information'
drwxr-xr-x 1 jhuang jhuang    0 Apr  5  2024 '$RECYCLE.BIN'

jhuang@WS-2290C:/media/jhuang/Elements(An14_RNAs)$ ls -tlrh
total 284K
drwxr-xr-x 1 jhuang jhuang 8,0K Aug  7  2017  Data_Anna10_RP62A
drwxr-xr-x 1 jhuang jhuang 4,0K Jun 15  2018  Data_Nicole12_16S_Kluwe_Bunders
drwxr-xr-x 1 jhuang jhuang 4,0K Nov 30  2018  chromhmm-enhancers
drwxr-xr-x 1 jhuang jhuang    0 Apr  1  2019  Data_Denise_sT_Methylation
drwxr-xr-x 1 jhuang jhuang    0 Apr  1  2019  Data_Denise_LTtrunc_Methylation
drwxr-xr-x 1 jhuang jhuang  12K Apr 29  2019  Data_16S_arckNov
drwxr-xr-x 1 jhuang jhuang 4,0K Mai 29  2019  Data_Tabea_RNASeq
-rwxr-xr-x 1 jhuang jhuang 4,6K Mai 29  2019  nr_gz_README
drwxr-xr-x 1 jhuang jhuang 4,0K Jun  5  2019  j_huang_raw_fq
drwxr-xr-x 1 jhuang jhuang    0 Jun  7  2019 'System Volume Information'
drwxr-xr-x 1 jhuang jhuang    0 Jun  7  2019 '$RECYCLE.BIN'
drwxr-xr-x 1 jhuang jhuang  36K Jun 18  2019  host_refs
drwxr-xr-x 1 jhuang jhuang    0 Jun 18  2019  Vraw
drwxr-xr-x 1 jhuang jhuang  68K Jul 29  2019  Data_Susanne_Amplicon_RdRp_orf1_2 *
drwxr-xr-x 1 jhuang jhuang 4,0K Aug  6  2019  tmp
drwxr-xr-x 1 jhuang jhuang  28K Sep  4  2020  Data_RNA188_Paul_Becher
drwxr-xr-x 1 jhuang jhuang 4,0K Nov  3  2020  Data_ChIPSeq_Laura
drwxr-xr-x 1 jhuang jhuang  12K Mai  7  2021  Data_16S_arckNov_review_PUBLISHED
drwxr-xr-x 1 jhuang jhuang 8,0K Mai  7  2021  Data_16S_arckNov_re
drwxr-xr-x 1 jhuang jhuang  20K Mai 25  2021  Fastqs
drwxr-xr-x 1 jhuang jhuang 4,0K Aug  9  2021  Data_Tabea_RNASeq_submission
drwxr-xr-x 1 jhuang jhuang 4,0K Aug 27  2021  Data_Anna_Cutibacterium_acnes_DEL
drwxr-xr-x 1 jhuang jhuang    0 Sep 16  2021  Data_Silvia_RNASeq_SUBMISSION
drwxr-xr-x 1 jhuang jhuang 4,0K Feb  9  2022  Data_Hannes_ChIPSeq
drwxr-xr-x 1 jhuang jhuang 4,0K Jul  5  2022  Data_Anna14_RNASeq_to_be_DEL
drwxr-xr-x 1 jhuang jhuang  40K Dez 15  2022  Data_Pietschmann_RSV_Probe2_PUBLISHED
drwxr-xr-x 1 jhuang jhuang    0 Dez 16  2022  Data_Holger_Klebsiella_pneumoniae_SNP_PUBLISHING
drwxr-xr-x 1 jhuang jhuang 4,0K Jun 29  2023  Data_Anna14_RNASeq_plus_public

jhuang@WS-2290C:/media/jhuang/Elements(Indra_HAPDICS)$ ls -trlh
total 452K
drwxrwxrwx 1 jhuang jhuang  20K Jul  3  2018  Data_Anna11_Sepdermidis_DEL
drwxrwxrwx 1 jhuang jhuang  20K Jul 12  2018  HD15_without_10
drwxrwxrwx 1 jhuang jhuang  12K Jul 12  2018  HD31
drwxrwxrwx 1 jhuang jhuang  20K Jul 12  2018  HD33
drwxrwxrwx 1 jhuang jhuang  20K Jul 12  2018  HD39
drwxrwxrwx 1 jhuang jhuang  20K Jul 12  2018  HD43
drwxrwxrwx 1 jhuang jhuang  20K Jul 12  2018  HD46
drwxrwxrwx 1 jhuang jhuang  20K Jul 12  2018  HD15_with_10
drwxrwxrwx 1 jhuang jhuang  12K Jul 13  2018  HD26
drwxrwxrwx 1 jhuang jhuang  20K Jul 13  2018  HD59
drwxrwxrwx 1 jhuang jhuang  12K Jul 13  2018  HD25
drwxrwxrwx 1 jhuang jhuang  20K Jul 16  2018  HD21
drwxrwxrwx 1 jhuang jhuang  20K Jul 17  2018  HD17
drwxrwxrwx 1 jhuang jhuang  24K Sep 24  2018  HD04
drwxrwxrwx 1 jhuang jhuang  20K Mär  5  2019  Data_Anna11_Pair1-6_P6
drwxrwxrwx 1 jhuang jhuang 4,0K Aug 15  2019  Data_Anna12_HAPDICS_HyAsP
drwxrwxrwx 1 jhuang jhuang  68K Dez 27  2019  HAPDICS_hyasp_plasmids
drwxrwxrwx 1 jhuang jhuang 8,0K Jan 14  2021  Data_Anna_HAPDICS_review
-rwxrwxrwx 1 jhuang jhuang 9,6K Jan 26  2021  data_overview.txt
drwxrwxrwx 1 jhuang jhuang 4,0K Jan 29  2021  align_assem_res_DEL
drwxrwxrwx 1 jhuang jhuang    0 Jun  8  2021 'System Volume Information'
drwxrwxrwx 1 jhuang jhuang 4,0K Jun  8  2021  EXCHANGE_DEL
drwxrwxrwx 1 jhuang jhuang 8,0K Aug 30  2021  Data_Indra_H3K4me3_public
drwxrwxrwx 1 jhuang jhuang 4,0K Feb 17  2022  Data_Gunnar_MS
drwxrwxrwx 1 jhuang jhuang 4,0K Jun  2  2022 '$RECYCLE.BIN'
drwxrwxrwx 1 jhuang jhuang 4,0K Jun  2  2022  UKE_DELLWorkstation_C_Users_indbe_Desktop
drwxrwxrwx 1 jhuang jhuang 4,0K Jun  2  2022  Linux_DELLWorkstation_C_Users_indbe_VirtualBoxVMs
drwxrwxrwx 1 jhuang jhuang 4,0K Jun 23  2022  Data_Anna_HAPDICS_RNASeq_rawdata
drwxrwxrwx 1 jhuang jhuang 8,0K Jun 23  2022  Data_Indra_H3K27ac_public
drwxrwxrwx 1 jhuang jhuang  28K Feb 22  2023  Data_Holger_Klebsiella_pneumoniae_SNP_PUBLISHING
drwxrwxrwx 1 jhuang jhuang 4,0K Dez  9  2024  DATA_INDRA_RNASEQ
drwxrwxrwx 1 jhuang jhuang 4,0K Dez  9  2024  DATA_INDRA_CHIPSEQ

jhuang@WS-2290C:/media/jhuang/Elements(jhuang_*)$ ls -ltrh
total 5,0M
-rwxr-xr-x  1 jhuang jhuang 657K Jul  9  2021 'Install Western Digital Software for Windows.exe'
-rwxr-xr-x  1 jhuang jhuang 498K Jul  9  2021 'Install Western Digital Software for Mac.dmg'
drwxr-xr-x  2 jhuang jhuang 1,0M Mai 17  2023 'System Volume Information'
drwxr-xr-x  2 jhuang jhuang 1,0M Aug 26  2024 '$RECYCLE.BIN'
drwxr-xr-x 11 jhuang jhuang 1,0M Feb  4  2025  20250203_FS10003086_95_BTR67811-0621

jhuang@WS-2290C:/media/jhuang/Smarty$ ls -tlrh
total 140K
drwx------  2 jhuang jhuang  16K Mär 14  2018 lost+found
drwxrwxrwx 21 jhuang jhuang  68K Jun 10  2022 Blast_db
drwxrwxr-x  2 jhuang jhuang 4,0K Sep  5  2022 temporary_files_DEL
drwxrwxr-x  9 jhuang jhuang  12K Sep  6  2022 ALIGN_ASSEM
drwxr-xr-x 19 jhuang jhuang 4,0K Sep 29  2022 Data_Paul_Staphylococcus_epidermidis
drwxrwxr-x 11 jhuang jhuang 4,0K Jan 26  2023 Data_16S_Degenhardt_Marius_DEL
drwxrwxr-x 16 jhuang jhuang 4,0K Jun 28  2023 Data_Gunnar_Yersiniomics_DEL
drwxrwxr-x  6 jhuang jhuang 4,0K Jul  5  2023 Data_Manja_RNAseq_Organoids_Virus
drwxrwxr-x 19 jhuang jhuang  12K Sep 27  2023 Data_Emilia_MeDIP
drwxr-xr-x 14 jhuang jhuang 4,0K Okt 30  2023 DjangoApp_Backup_2023-10-30
drwxrwxr-x  5 jhuang jhuang 4,0K Apr 19  2024 ref
drwxrwxr-x  4 jhuang jhuang 4,0K Jul 22  2025 Data_Michelle_RNAseq_2025_raw_data_DEL_AFTER_UPLOAD_GEO

按研究方向速查:生命科学常用数据库清单(Global Core Biodata Resources 精选)

下面是一份“按研究方向推荐常用数据库”的速查清单。按你做什么研究 → 该先去哪几个库来分组,每个库后面给一句“用来干嘛”。In summary, 做基因组先看 ENA/Ensembl/UCSC;做蛋白功能先上 UniProt/InterPro;做通路用 Reactome;做药物与小分子用 ChEMBL/ChEBI;做人类变异用 gnomAD/GWAS Catalog/ClinGen;做微生物命名和 16S 用 LPSN/SILVA;做模型生物就去 FlyBase/WormBase/ZFIN/MGD。


基因组学与序列数据

  • European Nucleotide Archive(ENA):原始测序数据、组装、注释的综合归档入口(欧洲体系)。
  • DNA Data Bank of Japan(DDBJ):日本序列数据归档(INSDC 成员之一)。
  • Ensembl:脊椎动物基因组浏览、比较基因组、变异、调控注释。
  • UCSC Genome Browser:人类及多物种基因组可视化浏览与注释轨道。
  • GENCODE:人/鼠高质量基因注释集合(常做标准参考)。

微生物/细菌方向(菌株信息、命名、16S 等)

  • BacDive:菌株层面的标准化信息(培养条件、表型、来源等)。
  • LPSN: List of Prokaryotic names with Standing in Nomenclature:原核命名权威信息(名称是否有效、分类更新)。
  • SILVA:16S/18S、23S/28S rRNA 序列与比对数据集(做分类/扩增子常用)。

蛋白质功能注释、家族结构域、互作网络

  • UniProt:蛋白序列与功能注释的“总入口”(最常用)。
  • InterPro:蛋白家族/结构域/功能位点整合分析(做注释和功能预测)。
  • CATH:蛋白结构域进化关系/结构分类。
  • STRING:蛋白互作网络(预测+整合证据),做功能关联很方便。
  • IMEx: International Molecular Exchange Consortium:高质量、人工整理的分子互作数据整合。
  • Protein Data Bank(PDB):蛋白/核酸 3D 结构的全球档案库(结构生物学必备)。

通路、代谢与反应数据库

  • Reactome:经典通路知识库(富集分析、机制解释常用)。
  • Rhea:生化反应与转运反应标准化知识库(注释/代谢研究)。
  • BRENDA:酶功能数据大全(底物、动力学、反应等)。
  • EcoCyc:大肠杆菌 K-12 的基因组与代谢通路精细注释库。

化学、小分子、药物靶点(药物研发/化学生物学)

  • ChEBI:小分子化学实体词典/本体(标准名、结构、分类)。
  • ChEMBL:药物样分子、活性、靶点关联(做药物发现/重定位很常用)。
  • IUPHAR/BPS Guide to PHARMACOLOGY:权威药理学知识库(配体-靶点关系、药物信息)。
  • LIPID MAPS:脂质组学资源与命名/分类体系。

转录组、表达谱、蛋白表达图谱

  • Bgee:跨物种表达模式对比(“这个基因在哪里表达?”)。
  • GXD:小鼠基因表达数据库(发育/组织表达等)。
  • Human Protein Atlas:人类组织/细胞层面的蛋白表达与定位图谱。
  • Europe PMC:生命科学文献入口(全文/摘要、资助信息等,做调研很高效)。

人类遗传变异、GWAS、疾病本体与临床解释

  • gnomAD:人群变异频率汇总(过滤“常见变异”必备)。
  • GWAS Catalog:GWAS SNP-性状关联的标准化数据库。
  • Clinical Genome Resource(ClinGen):基因/变异的临床相关性评估资源(精准医学)。
  • CIViC: Clinical Interpretation of Variants in Cancer:肿瘤变异临床意义的社区整理平台。
  • Human Disease Ontology Knowledgebase:疾病本体(统一术语、做整合分析很有用)。
  • ClinPGX:药物基因组学知识整理(基因变异影响用药反应)。

模型生物与专属物种数据库

  • FlyBase:果蝇遗传与分子数据。
  • WormBase:秀丽线虫及相关线虫的基因组与生物学数据。
  • ZFIN: The Zebrafish Information Network:斑马鱼模型数据。
  • MGD: Mouse Genome Database:小鼠基因组与表型/疾病关联数据。
  • PomBase:裂殖酵母资源库。
  • Saccharomyces Genome Database:出芽酵母数据库。
  • Rat Genome Database:大鼠基因组与表型/疾病数据。
  • Alliance of Genome Resources:多模型生物资源的整合入口(跨物种对照很方便)。

生物多样性、物种名录与分类学

  • Catalogue of Life:全球已知物种的统一名录与分类信息。
  • Global Biodiversity Information Facility(GBIF):全球生物多样性观测/标本记录等开放数据平台。

病原体与媒介(寄生虫/媒介昆虫等)

  • VEuPathDB:真核病原体及无脊椎媒介相关的大规模组学数据库集合。

NCBI 提交入口怎么选?一张“决策树”帮你不走弯路(GenBank / SRA / Genome / TSA / BioProject / BioSample / dbGaP / GTR / ClinVar)

很多人第一次在 NCBI 点“Start a new submission”会懵:这么多入口到底选哪个?下面给你一棵从目标出发的决策树,按着走基本不会错。如果你要公开的是“数据文件”(FASTQ/FASTA/组装/注释),不要选 GTR;如果你要公开的是“某个临床/研究检测项目的服务说明”,才选 GTR。


✅ 第一步:你要提交的是“原始测序数据”还是“组装/注释结果”?

A. 我有 FASTQ(原始 reads:Illumina/Nanopore/PacBio)

➡️ 选 Sequence Read Archive (SRA)

  • 你提交的是:reads + 文库信息(平台、PE/SE、策略等)
  • 几乎所有文章要求原始数据可复现,都需要 SRA

同时你通常还需要:

  • BioSample(每个样本的“身份证”)
  • BioProject(把整个项目的数据串起来)

✅ 常见路径:BioProject → BioSample → SRA


B. 我有组装好的基因组(contigs/scaffolds/complete genome)

➡️ 选 Genome(基因组提交主入口)

  • 适合:细菌/真菌/病毒/真核的 draft 或 complete genome
  • 会与 GenBank/Assembly 体系关联(后续可公开检索引用)

同时通常还需要:

  • BioSample(样本来源信息)
  • BioProject(项目汇总)
  • (可选但强烈建议)SRA(如果你也愿意公开原始 reads)

✅ 常见路径:BioProject → BioSample → SRA(可选/建议)→ Genome


C. 我只有一个基因/片段/质粒序列(不是整套基因组项目)

➡️ 选 GenBank

  • 适合:单基因、片段序列、单独的质粒序列、特定区域序列
  • 如果你在做“系统的基因组项目”,通常走 Genome 更合适;GenBank更像“序列条目提交”。

D. 我有转录组拼装结果(assembled transcripts,不是 reads)

➡️ 选 TSA(Transcriptome Shotgun Assembly)

  • TSA 提交的是:拼装后的转录本序列
  • 原始 RNA-seq reads 仍应走 SRA

✅ 常见路径:BioProject → BioSample → SRA → TSA


✅ 第二步:你提交的是“临床敏感人类数据/变异解释/检测项目”吗?

E. 数据涉及人类受试者隐私、需要受控访问(表型+基因型/临床队列)

➡️ 选 dbGaP(受控访问)

  • 适合:人类敏感数据
  • 常伴随伦理/权限/审查流程(不是完全公开下载)

F. 你要提交“变异的临床意义解读”(致病性、证据、表型关联)

➡️ 选 ClinVar

  • 适合:临床实验室/研究团队共享变异解释

G. 你要登记“遗传检测项目/检测服务信息”

➡️ 选 GTR(Genetic Testing Registry)

  • 更像“检测项目注册”,不是上传测序数据本体

✅ 第三步:你是不是在管理一个“项目集合”?

H. 你有多个样本/多批数据/多类型数据(SRA + Genome + 其它)

➡️ 建议先建 BioProject

  • 作用:项目总目录,方便引用与检索

I. 你每一个样本都需要可追溯的元数据(来源、地点、日期、宿主等)

➡️ 基本都需要 BioSample

  • 作用:样本身份证;SRA/Genome 通常都要挂它

终极“快速选择口诀”

  • FASTQ 原始 reads → SRA
  • 基因组组装(contigs/scaffolds/complete)→ Genome
  • 转录本拼装(transcripts)→ TSA
  • 单基因/片段/质粒序列条目 → GenBank
  • 把所有东西串成一个项目 → BioProject
  • 每个样本来源信息 → BioSample
  • 人类敏感受控数据 → dbGaP
  • 临床变异解释 → ClinVar
  • 遗传检测项目登记 → GTR
  • 批量/自动化 → API

下面是对 GTR(Genetic Testing Registry,遗传检测注册库) 的更详细中文说明。


GTR 是什么?

GTR 是 NCBI 上一个“登记遗传检测项目/检测服务信息”的公共目录,由提供检测的实验室/机构自愿提交,目的是让公众、临床医生和研究人员能查到:某个疾病/基因/病原体有哪些检测、由哪些实验室提供、检测方法是什么、适用范围和证据如何等。(NCBI)

关键点:GTR 不是用来上传 FASTQ/基因组序列的。

  • 原始测序数据 → SRA
  • 基因组组装/注释 → Genome / GenBank
  • GTR → 登记“检测项目本身”的信息(类似检测项目黄页/目录) (NCBI)

GTR 收录哪些“检测”?

GTR 的范围不仅是传统“单基因遗传病检测”,也包括:

  • 孟德尔遗传病、药物反应(药物基因组学)相关检测
  • 肿瘤/体细胞变异检测
  • 多基因 panel、芯片(array)、生化、细胞遗传、分子检测 (NCBI)
  • 微生物/病原体相关检测(例如病原体 panel、病毒载量、血清学抗体/抗原检测等) (NCBI)

在 GTR 里,一个“检测条目”通常会包含哪些信息?

你可以把它理解为“一个检测项目的说明书 + 实验室信息”组合,常见字段包括:

  1. 检测目的/用途:诊断、携带者筛查、预后、用药指导等 (NCBI)
  2. 检测对象(Target):基因/区域、变异类型、或病原体靶标等
  3. 方法学(Methodology):例如 PCR、Sanger、NGS panel、MLPA、芯片、qPCR、Nanopore 等(写清楚平台与策略)(NCBI)
  4. 适应证/关联疾病(Indication/Condition):对应哪些疾病/表型;并可建立“检测—靶标—适应证”的声明关系 (NCBI)
  5. 性能与证据:分析/临床有效性、参考文献、指南或标准等(GTR强调用途与证据展示)(NCBI)
  6. 实验室信息:机构名称、联系人、资质/认证信息、可提供的服务范围等 (NCBI)
  7. GTR accession:每个检测都有唯一编号,便于在论文/EHR 中引用。(NCBI)

谁应该提交 GTR?

主要是提供遗传/分子检测服务的实验室或机构(临床检验科、第三方医学检验所、商业检测机构、研究机构实验室等)。(NCBI)

如果你只是做科研并想公开数据:

  • 数据公开通常走 BioProject/BioSample + SRA + Genome/GenBank
  • 不一定需要 GTR(除非你在对外提供一个“检测项目/检测服务”)(NCBI)

GTR 怎么提交?(流程概览)

GTR 提交一般是两步走:

1)先注册“实验室(Laboratory record)”

先把实验室作为一个实体登记,GTR 会审核/联系新注册者;实验室通过后才可以提交具体检测项目。(NCBI)

2)再提交“检测(Test record)”

有两种方式:

  • 网页交互式提交:在提交门户里逐页填写信息(适合少量检测)(NCBI)
  • 批量提交(Excel 模板):适合大量临床检测项目;可用全字段或最小字段模板上传(研究检测的批量上传通常不开放/不支持)。(NCBI)

GTR vs ClinVar vs dbGaP:最容易混淆的三兄弟

  • GTR:登记“检测项目/检测服务”信息(谁提供、怎么测、测什么、适应证/证据)(NCBI)
  • ClinVar:提交“变异—临床意义”的解释与证据(致病性分类等)(你贴里之前也提过)
  • dbGaP:人类敏感数据(基因型/表型)受控访问的归档库

From Salmon to Subset Heatmaps: A Reproducible Pipeline for Phage/Stress/Biofilm Gene Panels (No p-value Cutoff, Data_JuliaFuchs_RNAseq_2025)

heatmap_18h_A_phage_merged3

This post documents a complete, batch-ready pipeline to generate subset heatmaps (phage / stress-response / biofilm-associated) from bacterial RNA-seq data quantified with Salmon, using DE tables without any p-value cutoff.

You will end with:

  • Three gene sets (A/B/C):

    • A (phage/prophage genes): extracted from MT880872.1.gb, mapped to CP052959 via BLASTN, converted to CP052959 GeneID_plain
    • B (stress genes): keyword-based selection from CP052959 GenBank annotations
    • C (biofilm genes): keyword-based selection from CP052959 GenBank annotations
  • For each *-all_annotated.csv in results/star_salmon/degenes/:

    • Subset GOI lists for A/B/C (no cutoff; include all rows belonging to the geneset)
    • Per-comparison *_matched.tsv tables for sanity checks
  • Merged 3-condition heatmaps (Untreated + Mitomycin + Moxi) per timepoint (4h/8h/18h) and subset (A/B/C), giving 9 final figures
  • An Excel file per heatmap containing GeneID, GeneName, Description, and the plotted expression matrix

Everything is written so you can run a single shell script for genesets + intersections, then one R script for heatmaps.


0) Environments

We use two conda environments:

  • plot-numpy1 for Python tools and BLAST setup
  • r_env for DESeq2 + plotting heatmaps in R
conda activate plot-numpy1

1) Directory layout

From your project root:

.
├── CP052959.gb
├── MT880872.1.gb
├── results/star_salmon/degenes/
│   ├── Mitomycin_4h_vs_Untreated_4h-all_annotated.csv
│   ├── ...
└── subset_heatmaps/          # all scripts + outputs go here

Create the output directory:

mkdir -p subset_heatmaps

2) Step A/B/C gene set generation + batch intersection (one command)

This section generates:

  • geneset_A_phage_GeneID_plain.id (+ GeneID.id)
  • geneset_B_stress_GeneID_plain.id (+ GeneID.id)
  • geneset_C_biofilm_GeneID_plain.id (+ GeneID.id)
  • plus all per-contrast GOI_* files and *_matched.tsv

2.1 Script: extract CDS FASTA from MT880872.1.gb

Save as subset_heatmaps/extract_cds_fasta.py

#!/usr/bin/env python3
from Bio import SeqIO
import sys

gb = sys.argv[1]
out_fa = sys.argv[2]

rec = SeqIO.read(gb, "genbank")
with open(out_fa, "w") as out:
    for f in rec.features:
        if f.type != "CDS":
            continue
        locus = f.qualifiers.get("locus_tag", ["NA"])[0]
        seq = f.extract(rec.seq)
        out.write(f">{locus}\n{str(seq).upper()}\n")

2.2 Script: BLAST hit mapping → CP052959 GeneID_plain set (geneset A)

Save as subset_heatmaps/blast_hits_to_geneset.py

#!/usr/bin/env python3
import sys
import pandas as pd
from Bio import SeqIO

blast6 = sys.argv[1]
cp_gb  = sys.argv[2]
prefix = sys.argv[3]  # e.g. subset_heatmaps/geneset_A_phage

# Load CP052959 CDS intervals
rec = SeqIO.read(cp_gb, "genbank")
cds = []
for f in rec.features:
    if f.type != "CDS":
        continue
    locus = f.qualifiers.get("locus_tag", [None])[0]
    if locus is None:
        continue
    start = int(f.location.start) + 1
    end   = int(f.location.end)
    cds.append((locus, start, end))
cds_df = pd.DataFrame(cds, columns=["GeneID_plain","start","end"])

# Load BLAST tabular (outfmt 6)
cols = ["qseqid","sseqid","pident","length","mismatch","gapopen","qstart","qend",
        "sstart","send","evalue","bitscore"]
b = pd.read_csv(blast6, sep="\t", names=cols)

# Normalize subject coordinates
b["smin"] = b[["sstart","send"]].min(axis=1)
b["smax"] = b[["sstart","send"]].max(axis=1)

# Filter for strong hits (tune if needed)
b = b[(b["pident"] >= 90.0) & (b["length"] >= 100)]

hits = set()
for _, r in b.iterrows():
    ov = cds_df[(r["smin"] <= cds_df["end"]) & (r["smax"] >= cds_df["start"])]
    hits.update(ov["GeneID_plain"].unique().tolist())

hits = sorted(hits)

plain_path = f"{prefix}_GeneID_plain.id"
geneid_path = f"{prefix}_GeneID.id"

pd.Series(hits).to_csv(plain_path, index=False, header=False)
pd.Series(["gene-" + x for x in hits]).to_csv(geneid_path, index=False, header=False)

print(f"Wrote {len(hits)} genes:")
print(" ", plain_path)
print(" ", geneid_path)

2.3 Script: keyword-based genesets B/C from CP052959 annotations

Save as subset_heatmaps/geneset_by_keywords.py

#!/usr/bin/env python3
import sys, re
import pandas as pd
from Bio import SeqIO

cp_gb  = sys.argv[1]
mode   = sys.argv[2]   # "stress" or "biofilm"
prefix = sys.argv[3]   # e.g. subset_heatmaps/geneset_B_stress

rec = SeqIO.read(cp_gb, "genbank")
rows=[]
for f in rec.features:
    if f.type != "CDS":
        continue
    locus = f.qualifiers.get("locus_tag", [None])[0]
    if locus is None:
        continue
    gene = (f.qualifiers.get("gene", [""])[0] or "")
    product = (f.qualifiers.get("product", [""])[0] or "")
    note = "; ".join(f.qualifiers.get("note", [])) if f.qualifiers.get("note") else ""
    text = " ".join([gene, product, note]).strip()
    rows.append((locus, gene, product, note, text))

df = pd.DataFrame(rows, columns=["GeneID_plain","gene","product","note","text"])

if mode == "stress":
    rgx = re.compile(
        r"\b(stress|heat shock|chaperone|dnaK|groEL|groES|clp|thioredoxin|peroxiredoxin|catalase|superoxide|"
        r"recA|lexA|uvr|mutS|mutL|usp|osm|sox|katA|sod)\b",
        re.I
    )
elif mode == "biofilm":
    rgx = re.compile(
        r"\b(biofilm|ica|pga|polysaccharide|PIA|adhesin|MSCRAMM|fibrinogen-binding|fibronectin-binding|"
        r"clumping factor|sortase|autolysin|atl|nuclease|DNase|protease|dispersin|luxS|agr|sarA|dlt)\b",
        re.I
    )
else:
    raise SystemExit("mode must be stress or biofilm")

sel = df[df["text"].apply(lambda x: bool(rgx.search(x)))].copy()
hits = sorted(sel["GeneID_plain"].unique())

plain_path = f"{prefix}_GeneID_plain.id"
geneid_path = f"{prefix}_GeneID.id"
sel_path = f"{prefix}_hits.tsv"

pd.Series(hits).to_csv(plain_path, index=False, header=False)
pd.Series(["gene-" + x for x in hits]).to_csv(geneid_path, index=False, header=False)
sel.drop(columns=["text"]).to_csv(sel_path, sep="\t", index=False)

print(f"{mode}: wrote {len(hits)} genes:")
print(" ", plain_path)
print(" ", geneid_path)
print(" ", sel_path)

2.4 Script: intersect each DE table with A/B/C (no cutoff) and write GOI lists + matched TSV

Save as subset_heatmaps/make_goi_lists_batch.py

#!/usr/bin/env python3
import sys, glob, os
import pandas as pd

de_dir = sys.argv[1]          # results/star_salmon/degenes
out_dir = sys.argv[2]         # subset_heatmaps
genesetA_plain = sys.argv[3]  # subset_heatmaps/geneset_A_phage_GeneID_plain.id
genesetB_plain = sys.argv[4]  # subset_heatmaps/geneset_B_stress_GeneID_plain.id
genesetC_plain = sys.argv[5]  # subset_heatmaps/geneset_C_biofilm_GeneID_plain.id

def load_plain_ids(path):
    with open(path) as f:
        return set(x.strip() for x in f if x.strip())

A = load_plain_ids(genesetA_plain)
B = load_plain_ids(genesetB_plain)
C = load_plain_ids(genesetC_plain)

def pick_id_cols(df):
    geneid = "GeneID" if "GeneID" in df.columns else None
    plain  = "GeneID_plain" if "GeneID_plain" in df.columns else None
    if plain is None and "GeneName" in df.columns:
        plain = "GeneName"
    return geneid, plain

os.makedirs(out_dir, exist_ok=True)

for csv in sorted(glob.glob(os.path.join(de_dir, "*-all_annotated.csv"))):
    base = os.path.basename(csv).replace("-all_annotated.csv", "")
    df = pd.read_csv(csv)
    geneid_col, plain_col = pick_id_cols(df)
    if plain_col is None:
        raise SystemExit(f"Cannot find GeneID_plain/GeneName in {csv}")

    df["__plain__"] = df[plain_col].astype(str).str.replace("^gene-","", regex=True)

    def write_set(tag, S):
        sub = df[df["__plain__"].isin(S)].copy()

        out_plain = os.path.join(out_dir, f"GOI_{base}_{tag}_GeneID_plain.id")
        out_geneid = os.path.join(out_dir, f"GOI_{base}_{tag}_GeneID.id")
        out_tsv = os.path.join(out_dir, f"{base}_{tag}_matched.tsv")

        sub["__plain__"].drop_duplicates().to_csv(out_plain, index=False, header=False)
        pd.Series(["gene-"+x for x in sub["__plain__"].drop_duplicates()]).to_csv(out_geneid, index=False, header=False)
        sub.to_csv(out_tsv, sep="\t", index=False)

        print(f"{base} {tag}: {sub.shape[0]} rows, {sub['__plain__'].nunique()} genes")

    write_set("A_phage", A)
    write_set("B_stress", B)
    write_set("C_biofilm", C)

2.5 Driver: run everything with one command

Save as subset_heatmaps/run_subset_setup.sh

#!/usr/bin/env bash
set -euo pipefail

DE_DIR="./results/star_salmon/degenes"
OUT_DIR="./subset_heatmaps"

CP_GB="CP052959.gb"
PHAGE_GB="MT880872.1.gb"

mkdir -p "$OUT_DIR"

echo "[INFO] Using DE_DIR=$DE_DIR"
ls -lh "$DE_DIR"/*-all_annotated.csv

# ---- A) BLAST-based phage/prophage geneset ----
python - <<'PY'
from Bio import SeqIO
rec=SeqIO.read("CP052959.gb","genbank")
SeqIO.write(rec, "subset_heatmaps/CP052959.fna", "fasta")
PY

python subset_heatmaps/extract_cds_fasta.py "$PHAGE_GB" "$OUT_DIR/MT880872_CDS.fna"

makeblastdb -in "$OUT_DIR/CP052959.fna" -dbtype nucl -out "$OUT_DIR/CP052959_db" >/dev/null

blastn \
  -query "$OUT_DIR/MT880872_CDS.fna" \
  -db "$OUT_DIR/CP052959_db" \
  -out "$OUT_DIR/MT_vs_CP.blast6" \
  -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore" \
  -evalue 1e-10

python subset_heatmaps/blast_hits_to_geneset.py \
  "$OUT_DIR/MT_vs_CP.blast6" "$CP_GB" "$OUT_DIR/geneset_A_phage"

# ---- B/C) keyword-based genesets ----
python subset_heatmaps/geneset_by_keywords.py "$CP_GB" stress  "$OUT_DIR/geneset_B_stress"
python subset_heatmaps/geneset_by_keywords.py "$CP_GB" biofilm "$OUT_DIR/geneset_C_biofilm"

# ---- Batch: intersect each DE CSV with the genesets (no cutoff) ----
python subset_heatmaps/make_goi_lists_batch.py \
  "$DE_DIR" "$OUT_DIR" \
  "$OUT_DIR/geneset_A_phage_GeneID_plain.id" \
  "$OUT_DIR/geneset_B_stress_GeneID_plain.id" \
  "$OUT_DIR/geneset_C_biofilm_GeneID_plain.id"

echo "[INFO] Done. GOI lists are in $OUT_DIR"
ls -1 "$OUT_DIR"/GOI_*_GeneID.id | head

Run it:

bash subset_heatmaps/run_subset_setup.sh

At this point you will have all *_matched.tsv files required for plotting, e.g.:

  • Mitomycin_4h_vs_Untreated_4h_A_phage_matched.tsv
  • Moxi_4h_vs_Untreated_4h_A_phage_matched.tsv
  • … (for 8h/18h and B/C)

3) No-cutoff heatmaps (merged Untreated + Mitomycin + Moxi → 9 figures)

Now switch to your R environment and build the rlog (rld) expression matrix from Salmon quantifications.

conda activate r_env

3.1 Build rld from Salmon outputs (R)

library(tximport)
library(DESeq2)

setwd("~/DATA/Data_JuliaFuchs_RNAseq_2025/results/star_salmon")

files <- c(
  "Untreated_4h_r1" = "./Untreated_4h_1a/quant.sf",
  "Untreated_4h_r2" = "./Untreated_4h_1b/quant.sf",
  "Untreated_4h_r3" = "./Untreated_4h_1c/quant.sf",
  "Untreated_8h_r1" = "./Untreated_8h_1d/quant.sf",
  "Untreated_8h_r2" = "./Untreated_8h_1e/quant.sf",
  "Untreated_8h_r3" = "./Untreated_8h_1f/quant.sf",
  "Untreated_18h_r1" = "./Untreated_18h_1g/quant.sf",
  "Untreated_18h_r2" = "./Untreated_18h_1h/quant.sf",
  "Untreated_18h_r3" = "./Untreated_18h_1i/quant.sf",
  "Mitomycin_4h_r1" = "./Mitomycin_4h_2a/quant.sf",
  "Mitomycin_4h_r2" = "./Mitomycin_4h_2b/quant.sf",
  "Mitomycin_4h_r3" = "./Mitomycin_4h_2c/quant.sf",
  "Mitomycin_8h_r1" = "./Mitomycin_8h_2d/quant.sf",
  "Mitomycin_8h_r2" = "./Mitomycin_8h_2e/quant.sf",
  "Mitomycin_8h_r3" = "./Mitomycin_8h_2f/quant.sf",
  "Mitomycin_18h_r1" = "./Mitomycin_18h_2g/quant.sf",
  "Mitomycin_18h_r2" = "./Mitomycin_18h_2h/quant.sf",
  "Mitomycin_18h_r3" = "./Mitomycin_18h_2i/quant.sf",
  "Moxi_4h_r1" = "./Moxi_4h_3a/quant.sf",
  "Moxi_4h_r2" = "./Moxi_4h_3b/quant.sf",
  "Moxi_4h_r3" = "./Moxi_4h_3c/quant.sf",
  "Moxi_8h_r1" = "./Moxi_8h_3d/quant.sf",
  "Moxi_8h_r2" = "./Moxi_8h_3e/quant.sf",
  "Moxi_8h_r3" = "./Moxi_8h_3f/quant.sf",
  "Moxi_18h_r1" = "./Moxi_18h_3g/quant.sf",
  "Moxi_18h_r2" = "./Moxi_18h_3h/quant.sf",
  "Moxi_18h_r3" = "./Moxi_18h_3i/quant.sf"
)

txi <- tximport(files, type = "salmon", txIn = TRUE, txOut = TRUE)

replicate <- factor(rep(c("r1","r2","r3"), 9))
condition <- factor(c(
  rep("Untreated_4h",3), rep("Untreated_8h",3), rep("Untreated_18h",3),
  rep("Mitomycin_4h",3), rep("Mitomycin_8h",3), rep("Mitomycin_18h",3),
  rep("Moxi_4h",3), rep("Moxi_8h",3), rep("Moxi_18h",3)
))

colData <- data.frame(condition=condition, replicate=replicate, row.names=names(files))
dds <- DESeqDataSetFromTximport(txi, colData, design = ~ condition)

rld <- rlogTransformation(dds)

3.2 Plot merged 3-condition subset heatmaps (R)

suppressPackageStartupMessages(library(gplots))

need <- c("openxlsx")
to_install <- setdiff(need, rownames(installed.packages()))
if (length(to_install)) install.packages(to_install, repos = "https://cloud.r-project.org")
suppressPackageStartupMessages(library(openxlsx))

in_dir  <- "subset_heatmaps"
out_dir <- file.path(in_dir, "heatmaps_merged3")
dir.create(out_dir, showWarnings = FALSE, recursive = TRUE)

pick_col <- function(df, candidates) {
  hit <- intersect(candidates, names(df))
  if (length(hit) == 0) return(NA_character_)
  hit[1]
}

strip_gene_prefix <- function(x) sub("^gene[-_]", "", x)

match_tags <- function(nms, tags) {
  pat <- paste0("(^|_)(?:", paste(tags, collapse = "|"), ")(_|$)")
  grepl(pat, nms, perl = TRUE)
}

detect_tag <- function(nm, tags) {
  hits <- vapply(tags, function(t)
    grepl(paste0("(^|_)", t, "(_|$)"), nm, perl = TRUE), logical(1))
  if (!any(hits)) NA_character_ else tags[which(hits)[1]]
}

make_pretty_labels <- function(gene_ids_in_matrix, id2name, id2desc) {
  plain <- strip_gene_prefix(gene_ids_in_matrix)
  nm <- unname(id2name[plain]); ds <- unname(id2desc[plain])
  nm[is.na(nm)] <- ""; ds[is.na(ds)] <- ""
  nm2 <- ifelse(nzchar(nm), nm, plain)
  lbl <- ifelse(nzchar(ds), paste0(nm2, " (", ds, ")"), nm2)
  make.unique(lbl, sep = "_")
}

if (exists("rld")) {
  expr_all <- assay(rld)
} else if (exists("vsd")) {
  expr_all <- assay(vsd)
} else {
  stop("Neither 'rld' nor 'vsd' exists. Create/load it before running this script.")
}
expr_all <- as.matrix(expr_all)
mat_ids <- rownames(expr_all)
if (is.null(mat_ids)) stop("Expression matrix has no rownames.")

times <- c("4h", "8h", "18h")
tags  <- c("A_phage", "B_stress", "C_biofilm")
cond_order_template <- c("Untreated_%s", "Mitomycin_%s", "Moxi_%s")

for (tt in times) {
  for (tag in tags) {

    f_mito <- file.path(in_dir, sprintf("Mitomycin_%s_vs_Untreated_%s_%s_matched.tsv", tt, tt, tag))
    f_moxi <- file.path(in_dir, sprintf("Moxi_%s_vs_Untreated_%s_%s_matched.tsv", tt, tt, tag))
    if (!file.exists(f_mito) || !file.exists(f_moxi)) next

    df1 <- read.delim(f_mito, sep = "\t", header = TRUE, stringsAsFactors = FALSE, check.names = FALSE)
    df2 <- read.delim(f_moxi, sep = "\t", header = TRUE, stringsAsFactors = FALSE, check.names = FALSE)

    id_col_1 <- pick_col(df1, c("GeneID","GeneID_plain","Gene_Id","gene_id","locus_tag","LocusTag","ID"))
    id_col_2 <- pick_col(df2, c("GeneID","GeneID_plain","Gene_Id","gene_id","locus_tag","LocusTag","ID"))
    if (is.na(id_col_1) || is.na(id_col_2)) next

    name_col_1 <- pick_col(df1, c("GeneName","Preferred_name","gene","Symbol","Name"))
    name_col_2 <- pick_col(df2, c("GeneName","Preferred_name","gene","Symbol","Name"))
    desc_col_1 <- pick_col(df1, c("Description","product","Product","annotation","Annot","note"))
    desc_col_2 <- pick_col(df2, c("Description","product","Product","annotation","Annot","note"))

    g1 <- unique(trimws(df1[[id_col_1]])); g1 <- g1[nzchar(g1)]
    g2 <- unique(trimws(df2[[id_col_2]])); g2 <- g2[nzchar(g2)]
    GOI_raw <- unique(c(g1, g2))

    present <- intersect(mat_ids, GOI_raw)
    if (!length(present)) {
      present <- unique(mat_ids[strip_gene_prefix(mat_ids) %in% strip_gene_prefix(GOI_raw)])
    }
    if (!length(present)) next

    getcol <- function(df, col, n) if (is.na(col)) rep("", n) else as.character(df[[col]])
    plain1 <- strip_gene_prefix(as.character(df1[[id_col_1]]))
    plain2 <- strip_gene_prefix(as.character(df2[[id_col_2]]))
    nm1 <- getcol(df1, name_col_1, nrow(df1)); nm2 <- getcol(df2, name_col_2, nrow(df2))
    ds1 <- getcol(df1, desc_col_1, nrow(df1)); ds2 <- getcol(df2, desc_col_2, nrow(df2))
    nm1[is.na(nm1)] <- ""; nm2[is.na(nm2)] <- ""
    ds1[is.na(ds1)] <- ""; ds2[is.na(ds2)] <- ""

    keys_all <- unique(c(plain1, plain2))
    id2name <- setNames(rep("", length(keys_all)), keys_all)
    id2desc <- setNames(rep("", length(keys_all)), keys_all)

    fill_map <- function(keys, vals, mp) {
      for (i in seq_along(keys)) {
        k <- keys[i]; v <- vals[i]
        if (!nzchar(k)) next
        if (!nzchar(mp[[k]]) && nzchar(v)) mp[[k]] <- v
      }
      mp
    }
    id2name <- fill_map(plain1, nm1, id2name); id2name <- fill_map(plain2, nm2, id2name)
    id2desc <- fill_map(plain1, ds1, id2desc); id2desc <- fill_map(plain2, ds2, id2desc)

    cond_tags <- sprintf(cond_order_template, tt)
    keep_cols <- match_tags(colnames(expr_all), cond_tags)
    if (!any(keep_cols)) next

    sub_idx <- which(keep_cols)
    sub_names <- colnames(expr_all)[sub_idx]
    cond_for_col <- vapply(sub_names, detect_tag, character(1), tags = cond_tags)
    cond_rank <- match(cond_for_col, cond_tags)
    ord <- order(cond_rank, sub_names)
    sub_idx <- sub_idx[ord]

    expr_sub <- expr_all[present, sub_idx, drop = FALSE]

    row_ok <- apply(expr_sub, 1, function(x) is.finite(sum(x)) && var(x, na.rm = TRUE) > 0)
    datamat <- expr_sub[row_ok, , drop = FALSE]
    if (nrow(datamat) < 2) next

    hr <- hclust(as.dist(1 - cor(t(datamat), method = "pearson")), method = "complete")
    mycl <- cutree(hr, h = max(hr$height) / 1.1)
    palette_base <- c("yellow","blue","orange","magenta","cyan","red","green","maroon",
                      "lightblue","pink","purple","lightcyan","salmon","lightgreen")
    mycol <- palette_base[(as.vector(mycl) - 1) %% length(palette_base) + 1]

    labRow <- make_pretty_labels(rownames(datamat), id2name, id2desc)
    labCol <- gsub("_", " ", colnames(datamat))

    gene_id <- rownames(datamat)
    gene_plain <- strip_gene_prefix(gene_id)
    gene_name <- unname(id2name[gene_plain]); gene_name[is.na(gene_name)] <- ""
    gene_desc <- unname(id2desc[gene_plain]); gene_desc[is.na(gene_desc)] <- ""

    out_tbl <- data.frame(
      GeneID = gene_id,
      GeneID_plain = gene_plain,
      GeneName = ifelse(nzchar(gene_name), gene_name, gene_plain),
      Description = gene_desc,
      datamat,
      check.names = FALSE,
      stringsAsFactors = FALSE
    )

    base <- sprintf("%s_%s_merged3", tt, tag)

    out_xlsx <- file.path(out_dir, paste0("table_", base, ".xlsx"))
    write.xlsx(out_tbl, out_xlsx, overwrite = TRUE)

    out_png <- file.path(out_dir, paste0("heatmap_", base, ".png"))
    cex_row <- if (nrow(datamat) > 600) 0.90 else if (nrow(datamat) > 300) 1.05 else 1.30
    height <- max(1600, min(18000, 34 * nrow(datamat)))

    png(out_png, width = 2200, height = height)
    heatmap.2(
      datamat,
      Rowv = as.dendrogram(hr),
      Colv = FALSE,
      dendrogram = "row",
      col = bluered(75),
      scale = "row",
      trace = "none",
      density.info = "none",
      RowSideColors = mycol,
      margins = c(12, 60),
      labRow = labRow,
      labCol = labCol,
      cexRow = cex_row,
      cexCol = 2.0,
      srtCol = 15,
      key = FALSE
    )
    dev.off()

    message("WROTE: ", out_png)
    message("WROTE: ", out_xlsx)
  }
}

message("Done. Output dir: ", out_dir)

Run it:

setwd("~/DATA/Data_JuliaFuchs_RNAseq_2025")
source("subset_heatmaps/draw_9_merged_heatmaps.R")

3.3 Optional: Plot 2-condition subset heatmaps (R)

#!/usr/bin/env Rscript

## =============================================================
## Draw 18 subset heatmaps using *_matched.tsv as input
## Output: subset_heatmaps/heatmaps_from_matched/
##
## Requirements:
##   - rld or vsd exists in environment (DESeq2 transform)
##     If running as Rscript, you must load/create rld/vsd BEFORE sourcing this file
##     (see the note at the bottom for the "source()" way)
##
## Matched TSV must contain GeneID or GeneID_plain (or GeneName) columns.
## =============================================================

suppressPackageStartupMessages(library(gplots))

in_dir  <- "subset_heatmaps"
out_dir <- file.path(in_dir, "heatmaps_from_matched")
dir.create(out_dir, showWarnings = FALSE, recursive = TRUE)

# -------------------------
# Helper functions
# -------------------------
pick_col <- function(df, candidates) {
  hit <- intersect(candidates, names(df))
  if (length(hit) == 0) return(NA_character_)
  hit[1]
}
strip_gene_prefix <- function(x) sub("^gene[-_]", "", x)

split_contrast_groups <- function(x) {
  parts <- strsplit(x, "_vs_", fixed = TRUE)[[1]]
  if (length(parts) != 2L) stop("Contrast must be in form A_vs_B: ", x)
  parts
}
match_tags <- function(nms, tags) {
  pat <- paste0("(^|_)(?:", paste(tags, collapse = "|"), ")(_|$)")
  grepl(pat, nms, perl = TRUE)
}

# -------------------------
# Get expression matrix
# -------------------------
if (exists("rld")) {
  expr_all <- assay(rld)
} else if (exists("vsd")) {
  expr_all <- assay(vsd)
} else {
  stop("Neither 'rld' nor 'vsd' exists. Create/load it before running this script.")
}
expr_all <- as.matrix(expr_all)
mat_ids <- rownames(expr_all)
if (is.null(mat_ids)) stop("Expression matrix has no rownames.")

# -------------------------
# List your 18 matched inputs
# -------------------------
matched_files <- c(
  "Mitomycin_4h_vs_Untreated_4h_A_phage_matched.tsv",
  "Mitomycin_4h_vs_Untreated_4h_B_stress_matched.tsv",
  "Mitomycin_4h_vs_Untreated_4h_C_biofilm_matched.tsv",
  "Mitomycin_8h_vs_Untreated_8h_A_phage_matched.tsv",
  "Mitomycin_8h_vs_Untreated_8h_B_stress_matched.tsv",
  "Mitomycin_8h_vs_Untreated_8h_C_biofilm_matched.tsv",
  "Mitomycin_18h_vs_Untreated_18h_A_phage_matched.tsv",
  "Mitomycin_18h_vs_Untreated_18h_B_stress_matched.tsv",
  "Mitomycin_18h_vs_Untreated_18h_C_biofilm_matched.tsv",
  "Moxi_4h_vs_Untreated_4h_A_phage_matched.tsv",
  "Moxi_4h_vs_Untreated_4h_B_stress_matched.tsv",
  "Moxi_4h_vs_Untreated_4h_C_biofilm_matched.tsv",
  "Moxi_8h_vs_Untreated_8h_A_phage_matched.tsv",
  "Moxi_8h_vs_Untreated_8h_B_stress_matched.tsv",
  "Moxi_8h_vs_Untreated_8h_C_biofilm_matched.tsv",
  "Moxi_18h_vs_Untreated_18h_A_phage_matched.tsv",
  "Moxi_18h_vs_Untreated_18h_B_stress_matched.tsv",
  "Moxi_18h_vs_Untreated_18h_C_biofilm_matched.tsv"
)

matched_paths <- file.path(in_dir, matched_files)

# -------------------------
# Main loop
# -------------------------
for (path in matched_paths) {

  if (!file.exists(path)) {
    message("SKIP missing: ", path)
    next
  }

  base <- sub("_matched\\.tsv$", "", basename(path))
  # base looks like: Mitomycin_4h_vs_Untreated_4h_A_phage

  # split base into contrast + tag (last 2 underscore fields are the tag)
  parts <- strsplit(base, "_")[[1]]
  if (length(parts) < 6) {
    message("SKIP unexpected name: ", base)
    next
  }

  # infer tag as last 2 parts: e.g. A_phage / B_stress / C_biofilm
  tag <- paste0(parts[length(parts)-1], "_", parts[length(parts)])
  # contrast is the rest
  contrast <- paste(parts[1:(length(parts)-2)], collapse = "_")

  # read matched TSV
  df <- read.delim(path, sep = "\t", header = TRUE, stringsAsFactors = FALSE, check.names = FALSE)

  id_col <- pick_col(df, c("GeneID", "GeneID_plain", "GeneName", "Gene_Id", "gene_id", "locus_tag", "LocusTag", "ID"))
  if (is.na(id_col)) {
    message("SKIP (no ID col): ", path)
    next
  }

  GOI_raw <- unique(trimws(df[[id_col]]))
  GOI_raw <- GOI_raw[nzchar(GOI_raw)]

  # match GOI to matrix ids robustly
  present <- intersect(mat_ids, GOI_raw)
  if (!length(present)) {
    present <- unique(mat_ids[strip_gene_prefix(mat_ids) %in% strip_gene_prefix(GOI_raw)])
  }
  if (!length(present)) {
    message("SKIP (no GOI matched matrix): ", base)
    next
  }

  # subset columns for the two groups
  groups <- split_contrast_groups(contrast)
  keep_cols <- match_tags(colnames(expr_all), groups)
  if (!any(keep_cols)) {
    message("SKIP (no columns matched groups): ", contrast)
    next
  }
  cols_idx <- which(keep_cols)
  sub_colnames <- colnames(expr_all)[cols_idx]

  # put Untreated first (2nd group in "Treated_vs_Untreated")
  ord <- order(!grepl(paste0("(^|_)", groups[2], "(_|$)"), sub_colnames, perl = TRUE))
  cols_idx <- cols_idx[ord]

  expr_sub <- expr_all[present, cols_idx, drop = FALSE]

  # remove constant/NA rows
  row_ok <- apply(expr_sub, 1, function(x) is.finite(sum(x)) && var(x, na.rm = TRUE) > 0)
  datamat <- expr_sub[row_ok, , drop = FALSE]
  if (nrow(datamat) < 2) {
    message("SKIP (too few rows after filtering): ", base)
    next
  }

  # clustering
  hr <- hclust(as.dist(1 - cor(t(datamat), method = "pearson")), method = "complete")
  mycl <- cutree(hr, h = max(hr$height) / 1.1)
  palette_base <- c("yellow","blue","orange","magenta","cyan","red","green","maroon",
                    "lightblue","pink","purple","lightcyan","salmon","lightgreen")
  mycol <- palette_base[(as.vector(mycl) - 1) %% length(palette_base) + 1]

  # labels
  labRow <- rownames(datamat)
  labRow <- sub("^gene-", "", labRow)
  labRow <- sub("^rna-", "", labRow)

  labCol <- colnames(datamat)
  labCol <- gsub("_", " ", labCol)

  # output sizes
  height <- max(900, min(12000, 25 * nrow(datamat)))

  out_png <- file.path(out_dir, paste0("heatmap_", base, ".png"))
  out_mat <- file.path(out_dir, paste0("matrix_", base, ".csv"))
  write.csv(as.data.frame(datamat), out_mat, quote = FALSE)

  png(out_png, width = 1100, height = height)
  heatmap.2(
    datamat,
    Rowv = as.dendrogram(hr),
    Colv = FALSE,
    dendrogram = "row",
    col = bluered(75),
    scale = "row",
    trace = "none",
    density.info = "none",
    RowSideColors = mycol,
    margins = c(10, 15),
    sepwidth = c(0, 0),
    labRow = labRow,
    labCol = labCol,
    cexRow = if (nrow(datamat) > 500) 0.6 else 1.0,
    cexCol = 1.7,
    srtCol = 15,
    lhei = c(0.01, 4),
    lwid = c(0.5, 4),
    key = FALSE
  )
  dev.off()

  message("WROTE: ", out_png)
}

message("All done. Output dir: ", out_dir)

Run it:

setwd("~/DATA/Data_JuliaFuchs_RNAseq_2025")
source("subset_heatmaps/draw_18_heatmaps_from_matched.R")

4) Optional: Update README_Heatmap to support “GOI file OR no-cutoff”

If you still use the older README_Heatmap logic that expects *-up.id and *-down.id, replace the GOI-building block with this (single GOI list or whole CSV with no cutoff):

geneset_file <- NA_character_   # e.g. "subset_heatmaps/GOI_Mitomycin_4h_vs_Untreated_4h_A_phage_GeneID.id"
use_all_genes_no_cutoff <- FALSE

if (!is.na(geneset_file) && file.exists(geneset_file)) {
  GOI <- read_ids_from_file(geneset_file)

} else if (isTRUE(use_all_genes_no_cutoff)) {
  all_path <- file.path("./results/star_salmon/degenes", paste0(contrast, "-all_annotated.csv"))
  ann <- read.csv(all_path, stringsAsFactors = FALSE, check.names = FALSE)

  id_col <- if ("GeneID" %in% names(ann)) "GeneID" else if ("GeneID_plain" %in% names(ann)) "GeneID_plain" else NA_character_
  if (is.na(id_col)) stop("No GeneID / GeneID_plain in: ", all_path)

  GOI <- unique(trimws(gsub('"', "", ann[[id_col]])))
  GOI <- GOI[nzchar(GOI)]

} else {
  stop("Set geneset_file OR set use_all_genes_no_cutoff <- TRUE")
}

present <- intersect(rownames(RNASeq.NoCellLine), GOI)
if (!length(present)) stop("None of the GOI found in expression matrix rownames.")
GOI <- present

5) Script inventory (bash + python)

Bash

  • subset_heatmaps/run_subset_setup.sh

Python

  • subset_heatmaps/extract_cds_fasta.py
  • subset_heatmaps/blast_hits_to_geneset.py
  • subset_heatmaps/geneset_by_keywords.py
  • subset_heatmaps/make_goi_lists_batch.py

R

  • subset_heatmaps/draw_9_merged_heatmaps.R
  • subset_heatmaps/draw_18_heatmaps_from_matched.R

This post is a lab wiki / GitHub README / methods note, every script referenced is included in full above (and can be copied into subset_heatmaps/ directly).

Bacterial WGS Pipeline (Isolate Genomes, Data_Tam_DNAseq_2026_Acinetobacter_harbinensis): nf-core/bacass → Assembly/QC → Annotation → AMR/Virulence → Core-Genome Phylogeny → ANI

AN6_core_tree

This post is a standalone, reproducible record of the bacterial WGS pipeline I used (example sample: AN6). I’m keeping all command lines (as-run) so you can reuse the workflow for future projects. Wherever you see absolute paths, replace them with your own.


0) Prerequisites (what you need installed)

  • Nextflow
  • Docker (for nf-core/bacass -profile docker)
  • Conda/Mamba
  • CLI tools used later: fastqc, spades.py, shovill, pigz, awk, seqkit, fastANI, plus R (for plotting), and the tools required by the provided scripts.

1) Download KmerFinder database

# Download the kmerfinder database: https://www.genomicepidemiology.org/services/ --> https://cge.food.dtu.dk/services/KmerFinder/ --> https://cge.food.dtu.dk/services/KmerFinder/etc/kmerfinder_db.tar.gz
# Download 20190108_kmerfinder_stable_dirs.tar.gz from https://zenodo.org/records/13447056

2) Run nf-core/bacass (Nextflow)

    #--kmerfinderdb /path/to/kmerfinder/bacteria.tar.gz
    #--kmerfinderdb /mnt/nvme1n1p1/REFs/kmerfinder_db.tar.gz
    #--kmerfinderdb /mnt/nvme1n1p1/REFs/20190108_kmerfinder_stable_dirs.tar.gz
    nextflow run nf-core/bacass -r 2.5.0 -profile docker \
      --input samplesheet.tsv \
      --outdir bacass_out \
      --assembly_type long \
      --kraken2db /mnt/nvme1n1p1/REFs/k2_standard_08_GB_20251015.tar.gz \
      --kmerfinderdb /mnt/nvme1n1p1/REFs/kmerfinder/bacteria/ \
      -resume

    #SAVE bacass_out/Kmerfinder/kmerfinder_summary.csv to bacass_out/Kmerfinder/An6/An6_kmerfinder_results.xlsx

3) Assembly (AN6 example)

3.1 Link raw reads + run FastQC

ln -s ../X101SC25116512-Z01-J002/01.RawData/An6/An6_1.fq.gz An6_R1.fastq.gz
ln -s ../X101SC25116512-Z01-J002/01.RawData/An6/An6_2.fq.gz An6_R2.fastq.gz
mkdir fastqc_out
fastqc -t 4 raw_data/* -o fastqc_out/
mamba activate /home/jhuang/miniconda3/envs/bengal3_ac3

3.2 Trimming decision notes (kept as recorded)

For the AN6 data, it’s not better to run Trimmomatic first in most cases (adapters OK; per-tile failures are instrument/tile related and not “fixed” by trimming).

* **Adapters:** FastQC shows **Adapter Content = PASS** for both R1/R2.
* **Overrepresented sequences:** none detected.
* **Per-tile sequence quality:** **FAIL** (this is usually an instrument/tile effect; trimming adapters won’t “fix” it).

Shovill: avoid pre-trimming (default read trimming already included unless disabled). SPAdes: trimming optional; try raw first, then trimmed if needed.

3.3 If you do need Trimmomatic (command kept)

# Paired-end trimming with Trimmomatic (Illumina-style)
# Adjust TRIMMOMATIC_JAR and ADAPTERS paths to your install.

TRIMMOMATIC_JAR=/path/to/trimmomatic.jar
ADAPTERS=/path/to/Trimmomatic/adapters/TruSeq3-PE.fa

java -jar "$TRIMMOMATIC_JAR" PE -threads 16 -phred33 \
  An6_R1.fastq.gz An6_R2.fastq.gz \
  An6_R1.trim.paired.fastq.gz An6_R1.trim.unpaired.fastq.gz \
  An6_R2.trim.paired.fastq.gz An6_R2.trim.unpaired.fastq.gz \
  ILLUMINACLIP:"$ADAPTERS":2:30:10 \
  LEADING:3 TRAILING:3 \
  SLIDINGWINDOW:4:20 \
  MINLEN:50

What you feed into SPAdes/Shovill afterward:

  • Use the paired outputs:

    • An6_R1.trim.paired.fastq.gz
    • An6_R2.trim.paired.fastq.gz
  • Optional: you can include unpaired reads in SPAdes, but many people skip them for isolate assemblies unless coverage is low.

If you want, I can also give the matching SPAdes command that includes unpaired reads (or the cleanest approach that ignores them).

spades.py \
  -1 raw_data/An6_R1.fastq.gz \
  -2 raw_data/An6_R2.fastq.gz \
  --isolate \
  -t 32 -m 250 \
  -o spades_out
spades.py \
  -1 raw_data/An6_R1.fastq.gz \
  -2 raw_data/An6_R2.fastq.gz \
  --careful \
  -t 32 -m 250 \
  -o spades_out_careful

Shovill (CHOSEN; default does read trimming unless you disable it):

shovill \
  --R1 raw_data/An6_R1.fastq.gz \
  --R2 raw_data/An6_R2.fastq.gz \
  --outdir shovill_out \
  --cpus 32 --ram 250 \
  --depth 100

If you want to keep reads completely untrimmed in Shovill, add --noreadtrim.


4) Genome annotation — BV-BRC ComprehensiveGenomeAnalysis

* Use: https://www.bv-brc.org/app/ComprehensiveGenomeAnalysis
* Input: scaffolded results from bacass
* Purpose: comprehensive overview + annotation of the genome assembly.

5) Table 1 — summary of sequence data + genome features (env: gunc_env)

5.1 Environment prep + pipeline run (kept)

# Prepare environment and run the Table 1 (Summary of sequence data and genome features (env: gunc_env)) pipeline:

# activate the env that has openpyxl
mamba activate gunc_env
mamba install -n gunc_env -c conda-forge openpyxl -y
mamba deactivate

# STEP_1
ENV_NAME=gunc_env \
SAMPLE=AN6 \
ASM=shovill_out/contigs.fa \
R1=./X101SC25116512-Z01-J002/01.RawData/An6/An6_1.fq.gz \
R2=./X101SC25116512-Z01-J002/01.RawData/An6/An6_2.fq.gz \
./make_table1_pe.sh

# STEP_2
python export_table1_stats_to_excel_py36_compat.py \
  --workdir table1_AN6_work \
  --out Comprehensive_AN6.xlsx \
  --max-rows 200000 \
  --sample AN6

5.2 Manual calculations (kept)

#Manually For the items “Total number of reads sequenced” and “Mean read length (bp)”:
#Total number of reads sequenced    9,127,297 × 2
#Coverage depth (sequencing depth)  589.4×

pigz -dc X101SC25116512-Z01-J002/01.RawData/An6/An6_1.fq.gz | awk 'END{print NR/4}'
seqkit stats X101SC25116512-Z01-J002/01.RawData/An6/An6_1.fq.gz
#file                                                format  type    num_seqs        sum_len  min_len  avg_len  max_len
#X101SC25116512-Z01-J002/01.RawData/An6/An6_1.fq.gz  FASTQ   DNA   15,929,405  2,389,410,750      150      150      150

5.3 Example metrics table snapshot (kept)

Metricsa    Value
Genome size (bp)    3,012,410
Contig count (>= 500 bp)    41
Total number of reads sequenced     15,929,405 × 2
Coverage depth (sequencing depth)   1454.3×
Coarse consistency (%)  99.67
Fine consistency (%)    94.50
Completeness (%)    99.73
Contamination (%)   0.21
Contigs N50 (bp)    169,757
Contigs L50     4
Guanine-cytosine content (%)    41.14
Number of coding sequences (CDSs)   2,938
Number of tRNAs     69
Number of rRNAs     3

6) AMR / virulence screening (ABRicate workflows)

    cp shovill_out/contigs.fa AN6.fasta

    ENV_NAME=/home/jhuang/miniconda3/envs/bengal3_ac3 ASM=AN6.fasta SAMPLE=AN6 THREADS=32 ./run_resistome_virulome_dedup.sh  #Default MINID=90 MINCOV=60
    ENV_NAME=/home/jhuang/miniconda3/envs/bengal3_ac3 ASM=AN6.fasta SAMPLE=AN6 MINID=80 MINCOV=60 ./run_resistome_virulome_dedup.sh    # 0 0 0 0
    ENV_NAME=/home/jhuang/miniconda3/envs/bengal3_ac3 ASM=AN6.fasta SAMPLE=AN6 MINID=70 MINCOV=50 ./run_resistome_virulome_dedup.sh    # 5 5 0 4
    #Sanity checks on ABRicate outputs
    grep -vc '^#' resistome_virulence_AN6/raw/AN6.megares.tab
    grep -vc '^#' resistome_virulence_AN6/raw/AN6.card.tab
    grep -vc '^#' resistome_virulence_AN6/raw/AN6.resfinder.tab
    grep -vc '^#' resistome_virulence_AN6/raw/AN6.vfdb.tab

    #!!!!!! DEBUG_TOMORROW: why using 'MINID=70 MINCOV=50' didn't return the 5504?
    #Dedup tables / “one per gene” mode
    rm Resistome_Virulence_An6.xlsx
    chmod +x run_abricate_resistome_virulome_one_per_gene.sh
    ENV_NAME=/home/jhuang/miniconda3/envs/bengal3_ac3 \
    ASM=AN6.fasta \
    SAMPLE=AN6 \
    OUTDIR=resistome_virulence_AN6 \
    MINID=70 MINCOV=50 \
    THREADS=32 \
    ./run_abricate_resistome_virulome_one_per_gene.sh

    cd resistome_virulence_AN6
    python3 -c 'import pandas as pd; from pathlib import Path; files=["Table_AMR_genes_dedup.tsv","Table_AMR_genes_one_per_gene.tsv","Table_Virulence_VFDB_dedup.tsv","Table_DB_hit_counts.tsv"]; out="AN6_resistome_virulence.xlsx"; w=pd.ExcelWriter(out, engine="openpyxl"); [pd.read_csv(f, sep="\t").to_excel(w, sheet_name=Path(f).stem[:31], index=False) for f in files]; w.close(); print(out)'

7) Core-genome phylogeny (NCBI + Roary + RAxML-NG + R plotting)

  #Generate targets.tsv from ./bvbrc_out/Acinetobacter_harbinensis_AN6/FullGenomeReport.html.

    export NCBI_EMAIL="xxx@yyy.de"
    ./resolve_best_assemblies_entrez.py targets.tsv resolved_accessions.tsv

    #[OK] Acinetobacter_harbinensis_HITLi7 -> GCF_000816495.1 (Scaffold)
    #[OK] Acinetobacter_sp._ANC -> GCF_965200015.1 (Complete Genome)
    #[OK] Acinetobacter_sp._TTH0-4 -> GCF_965200015.1 (Complete Genome)
    #[OK] Acinetobacter_tandoii_DSM_14970 -> GCF_000621065.1 (Scaffold)
    #[OK] Acinetobacter_towneri_DSM_14962 -> GCF_000368785.1 (Scaffold)
    #[OK] Acinetobacter_radioresistens_SH164 -> GCF_000162115.1 (Scaffold)
    #[OK] Acinetobacter_radioresistens_SK82 -> GCF_000175675.1 (Contig)
    #[OK] Acinetobacter_radioresistens_DSM_6976 -> GCF_000368905.1 (Scaffold)
    #[OK] Acinetobacter_indicus_ANC -> GCF_000413875.1 (Scaffold)
    #[OK] Acinetobacter_indicus_CIP_110367 -> GCF_000488255.1 (Scaffold)

    #NOTE the env bengal3_ac3 don’t have the following R package, using r_env for the plot-step → RUN TWICE, first bengal3_ac3, then run build_wgs_tree_fig3B.sh plot-only.
    #ADAPT the params EXTRA_ASSEMBLIES (could stay as empty), and AN6.fasta as REF_FASTA
    conda activate /home/jhuang/miniconda3/envs/bengal3_ac3
    export NCBI_EMAIL="xxx@yyy.de"
    ENV_NAME=/home/jhuang/miniconda3/envs/bengal3_ac3 ./build_wgs_tree_fig3B.sh

    # (Optional) if want to delete some leaves from the tree, remove from inputs so Roary cannot include it
    for id in "GCF_002291425.1" "GCF_047901425.1" "GCF_004342245.1" "GCA_032062225.1"; do
      rm -f work_wgs_tree/gffs/${id}.gff
      rm -f work_wgs_tree/fastas/${id}.fna
      rm -rf work_wgs_tree/prokka/${id}
      rm -rf work_wgs_tree/genomes_ncbi/${id}
      # remove from accession list so it won't come back
      awk -F'\t' 'NR==1 || $2!="${id}"' work_wgs_tree/meta/accessions.tsv > work_wgs_tree/meta/accessions.tsv.tmp \
      && mv work_wgs_tree/meta/accessions.tsv.tmp work_wgs_tree/meta/accessions.tsv
    done

    ./build_wgs_tree_fig3B.sh
    #Wrote: work_wgs_tree/plot/labels.tsv
    #Error: package or namespace load failed for ‘ggtree’ in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]):
    #there is no package called ‘aplot’
    #Execution halted --> Using env r_env instead (see below)!

    # Run this to regenerate labels.tsv
    bash regenerate_labels.sh

    # Regenerate the plot --> ERROR --> Using Rscript instead (see below)!
    ENV_NAME=/home/jhuang/mambaforge/envs/r_env ./build_wgs_tree_fig3B.sh plot-only
    #-->Error in as.hclust.phylo(tr) : the tree is not ultrametric

    # 8) Manual correct the display name in work_wgs_tree/plot/labels.tsv
    #sample display
    #GCF_000816495.1    Acinetobacter harbinensis HITLi7 (GCF_000816495.1)
    #GCF_965200015.1    Acinetobacter sp. ANC (GCF_965200015.1)
    #GCF_000621065.1    Acinetobacter tandoii DSM 14970 (GCF_000621065.1)
    #GCF_000368785.1    Acinetobacter towneri DSM 14962 (GCF_000368785.1)
    #GCF_000162115.1    Acinetobacter radioresistens SH164 (GCF_000162115.1)
    #GCF_000175675.1    Acinetobacter radioresistens SK82 (GCF_000175675.1)
    #GCF_000368905.1    Acinetobacter radioresistens DSM 6976 (GCF_000368905.1)
    #GCF_000413875.1    Acinetobacter indicus ANC (GCF_000413875.1)
    #GCF_000488255.1    Acinetobacter indicus CIP 110367 (GCF_000488255.1)
    #REF    AN6

    # 9) Rerun only the plot step uisng plot_tree_v4.R
    Rscript ./plot_tree_v4.R \
      work_wgs_tree/raxmlng/core.raxml.support \
      work_wgs_tree/plot/labels.tsv \
      6 \
      work_wgs_tree/plot/core_tree.pdf \
      work_wgs_tree/plot/core_tree.png

8) ANI confirmation (fastANI loop)

    mamba activate /home/jhuang/miniconda3/envs/bengal3_ac3
    for id in GCF_000621065.1.fna GCF_000368785.1.fna GCF_000175675.1.fna GCF_000368905.1.fna GCF_000816495.1.fna GCF_965200015.1.fna GCF_000488255.1.fna GCF_000413875.1.fna GCF_000162115.1.fna; do
      fastANI -q AN6.fasta -r ./work_wgs_tree/fastas/${id} -o fastANI_AN6_vs_${id}.txt
    done
    # Alternatively, we can use the script run_fastani_batch_verbose.sh.

9) Contig-to-reference mapping (how many contigs map?)

In total, we obtained 41 contigs >500 nt. Of these, 36 contigs were scaffolded with Multi-CSAR v1.1 into three chromosomal scaffolds:

  • SCF_1: 1,773,912 bp
  • SCF_2: 1,197,749 bp
  • SCF_3: 23,925 bp Total: 2,995,586 bp

The remaining five contigs (contig00026/32/33/37/39) could not be scaffolded. Their partial BLASTn matches to both plasmid and chromosomal sequences suggest shared mobile elements, but do not confirm circular plasmids. A sequence/assembly summary was exported to Excel (Summary_AN6.xlsx), including read yield/read-length statistics and key assembly/QC metrics (genome size, contigs/scaffolds, N50, GC%, completeness, contamination).


Complete scripts (as attached)

Below are the full scripts exactly as provided, including plot_tree_v4.R.


make_table1_pe.sh

#!/usr/bin/env bash
set -Eeuo pipefail

# =========================
# User config
ENV_NAME="${ENV_NAME:-checkm_env2}"

# If you have Illumina paired-end, set R1/R2 (recommended)
R1="${R1:-}"
R2="${R2:-}"

# If you have single-end/ONT-like reads, set READS instead (legacy mode)
READS="${READS:-}"

ASM="${ASM:-shovill_out/contigs.fa}"
SAMPLE="${SAMPLE:-An6}"

THREADS="${THREADS:-32}"
OUT_TSV="${OUT_TSV:-Table1_${SAMPLE}.tsv}"

WORKDIR="${WORKDIR:-table1_${SAMPLE}_work}"
LOGDIR="${LOGDIR:-${WORKDIR}/logs}"
LOGFILE="${LOGFILE:-${LOGDIR}/run_$(date +%F_%H%M%S).log}"

AUTO_INSTALL="${AUTO_INSTALL:-1}"   # 1=install missing tools in ENV_NAME
GUNC_DB_KIND="${GUNC_DB_KIND:-progenomes}"  # progenomes or gtdb
# =========================

mkdir -p "${LOGDIR}"
exec > >(tee -a "${LOGFILE}") 2>&1

ts(){ date +"%F %T"; }
log(){ echo "[$(ts)] $*"; }

on_err() {
  local ec=$?
  log "ERROR: failed (exit=${ec}) at line ${BASH_LINENO[0]}: ${BASH_COMMAND}"
  log "Logfile: ${LOGFILE}"
  exit "${ec}"
}
trap on_err ERR

# print every command
set -x

need_cmd(){ command -v "$1" >/dev/null 2>&1; }

pick_pm() {
  if need_cmd mamba; then echo "mamba"
  elif need_cmd conda; then echo "conda"
  else
    log "ERROR: neither mamba nor conda found in PATH"
    exit 1
  fi
}

activate_env() {
  if ! need_cmd conda; then
    log "ERROR: conda not found; cannot activate env"
    exit 1
  fi
  # shellcheck disable=SC1091
  source "$(conda info --base)/etc/profile.d/conda.sh"
  conda activate "${ENV_NAME}"
}

ensure_env_exists() {
  # shellcheck disable=SC1091
  source "$(conda info --base)/etc/profile.d/conda.sh"
  if ! conda env list | awk '{print $1}' | grep -qx "${ENV_NAME}"; then
    log "ERROR: env ${ENV_NAME} not found. Create it first."
    exit 1
  fi
}

install_pkgs_in_env() {
  local pm="$1"; shift
  local pkgs=("$@")
  log "Installing into env ${ENV_NAME}: ${pkgs[*]}"
  "${pm}" install -n "${ENV_NAME}" -c bioconda -c conda-forge -y "${pkgs[@]}"
}

pick_quast_cmd() {
  if need_cmd quast; then echo "quast"
  elif need_cmd quast.py; then echo "quast.py"
  else echo ""
  fi
}

# tool->package mapping (install missing ones)
declare -A TOOL2PKG=(
  [quast]="quast"
  [minimap2]="minimap2"
  [samtools]="samtools"
  [mosdepth]="mosdepth"
  [checkm]="checkm-genome=1.1.3"
  [gunc]="gunc"
  [python]="python"
)

# =========================
# Detect mode (PE vs single)
MODE=""
if [[ -n "${R1}" || -n "${R2}" ]]; then
  [[ -n "${R1}" && -n "${R2}" ]] || { log "ERROR: Provide both R1 and R2."; exit 1; }
  MODE="PE"
elif [[ -n "${READS}" ]]; then
  MODE="SINGLE"
else
  log "ERROR: Provide either (R1+R2) OR READS."
  exit 1
fi

# =========================
# Start
log "Start: Table 1 generation (reuse env=${ENV_NAME})"
log "Assembly: ${ASM}"
log "Sample:   ${SAMPLE}"
log "Threads:  ${THREADS}"
log "Workdir:  ${WORKDIR}"
log "Logfile:  ${LOGFILE}"
log "Mode:     ${MODE}"
if [[ "${MODE}" == "PE" ]]; then
  log "R1:       ${R1}"
  log "R2:       ${R2}"
else
  log "Reads:    ${READS}"
fi

PM="$(pick_pm)"
log "Pkg manager: ${PM}"

ensure_env_exists
activate_env

log "Active envs:"
conda info --envs

log "Versions (if available):"
( python --version || true )
( checkm --version || true )
( gunc -v || true )
( minimap2 --version 2>&1 | head -n 2 || true )
( samtools --version 2>&1 | head -n 2 || true )
( mosdepth --version 2>&1 | head -n 2 || true )
( quast --version 2>&1 | head -n 2 || true )
( quast.py --version 2>&1 | head -n 2 || true )

# =========================
# Check/install missing tools in this env
MISSING_PKGS=()

for tool in minimap2 samtools mosdepth checkm gunc python; do
  if ! need_cmd "${tool}"; then
    MISSING_PKGS+=("${TOOL2PKG[$tool]}")
  fi
done

QUAST_CMD="$(pick_quast_cmd)"
if [[ -z "${QUAST_CMD}" ]]; then
  MISSING_PKGS+=("${TOOL2PKG[quast]}")
fi

if [[ "${#MISSING_PKGS[@]}" -gt 0 ]]; then
  if [[ "${AUTO_INSTALL}" != "1" ]]; then
    log "ERROR: missing tools and AUTO_INSTALL=0. Missing packages: ${MISSING_PKGS[*]}"
    exit 1
  fi
  mapfile -t UNIQUE < <(printf "%s\n" "${MISSING_PKGS[@]}" | awk '!seen[$0]++')
  install_pkgs_in_env "${PM}" "${UNIQUE[@]}"
  activate_env
  QUAST_CMD="$(pick_quast_cmd)"
fi

for tool in minimap2 samtools mosdepth checkm gunc python; do
  need_cmd "${tool}" || { log "ERROR: still missing tool: ${tool}"; exit 1; }
done
[[ -n "${QUAST_CMD}" ]] || { log "ERROR: QUAST still missing."; exit 1; }

log "All tools ready. QUAST cmd: ${QUAST_CMD}"

# =========================
# Prepare workdir
mkdir -p "${WORKDIR}"/{genomes,reads,stats,quast,map,checkm,gunc,tmp}

ASM_ABS="$(realpath "${ASM}")"
ln -sf "${ASM_ABS}" "${WORKDIR}/genomes/${SAMPLE}.fasta"

if [[ "${MODE}" == "PE" ]]; then
  R1_ABS="$(realpath "${R1}")"
  R2_ABS="$(realpath "${R2}")"
  ln -sf "${R1_ABS}" "${WORKDIR}/reads/${SAMPLE}.R1.fastq.gz"
  ln -sf "${R2_ABS}" "${WORKDIR}/reads/${SAMPLE}.R2.fastq.gz"
else
  READS_ABS="$(realpath "${READS}")"
  ln -sf "${READS_ABS}" "${WORKDIR}/reads/${SAMPLE}.reads.fastq.gz"
fi

# =========================
# 1) QUAST
log "Run QUAST..."
"${QUAST_CMD}" "${WORKDIR}/genomes/${SAMPLE}.fasta" -o "${WORKDIR}/quast"
QUAST_TSV="${WORKDIR}/quast/report.tsv"
test -s "${QUAST_TSV}"

# =========================
# 2) Map reads + mosdepth
log "Map reads (minimap2) + sort BAM..."
SORT_T="$((THREADS>16?16:THREADS))"

if [[ "${MODE}" == "PE" ]]; then
  minimap2 -t "${THREADS}" -ax sr \
    "${WORKDIR}/genomes/${SAMPLE}.fasta" \
    "${WORKDIR}/reads/${SAMPLE}.R1.fastq.gz" "${WORKDIR}/reads/${SAMPLE}.R2.fastq.gz" \
    | samtools sort -@ "${SORT_T}" -o "${WORKDIR}/map/${SAMPLE}.bam" -
else
  # legacy single-read mode; keep map-ont as in original script
  minimap2 -t "${THREADS}" -ax map-ont \
    "${WORKDIR}/genomes/${SAMPLE}.fasta" "${WORKDIR}/reads/${SAMPLE}.reads.fastq.gz" \
    | samtools sort -@ "${SORT_T}" -o "${WORKDIR}/map/${SAMPLE}.bam" -
fi

samtools index "${WORKDIR}/map/${SAMPLE}.bam"

log "Compute depth (mosdepth)..."
mosdepth -t "${SORT_T}" "${WORKDIR}/map/${SAMPLE}" "${WORKDIR}/map/${SAMPLE}.bam"
MOS_SUMMARY="${WORKDIR}/map/${SAMPLE}.mosdepth.summary.txt"
test -s "${MOS_SUMMARY}"

# =========================
# 3) CheckM
log "Run CheckM lineage_wf..."
checkm lineage_wf -x fasta -t "${THREADS}" "${WORKDIR}/genomes" "${WORKDIR}/checkm/out"

log "Run CheckM qa..."
checkm qa "${WORKDIR}/checkm/out/lineage.ms" "${WORKDIR}/checkm/out" --tab_table -o 2 \
  > "${WORKDIR}/checkm/checkm_summary.tsv"
CHECKM_SUM="${WORKDIR}/checkm/checkm_summary.tsv"
test -s "${CHECKM_SUM}"

# =========================
# 4) GUNC
log "Run GUNC..."
mkdir -p "${WORKDIR}/gunc/db" "${WORKDIR}/gunc/out"

if [[ -z "$(ls -A "${WORKDIR}/gunc/db" 2>/dev/null || true)" ]]; then
  log "Downloading GUNC DB kind=${GUNC_DB_KIND} to ${WORKDIR}/gunc/db ..."
  gunc download_db -db "${GUNC_DB_KIND}" "${WORKDIR}/gunc/db"
fi

DMND="$(find "${WORKDIR}/gunc/db" -type f -name "*.dmnd" | head -n 1 || true)"
if [[ -z "${DMND}" ]]; then
  log "ERROR: No *.dmnd found under ${WORKDIR}/gunc/db after download."
  ls -lah "${WORKDIR}/gunc/db" || true
  exit 1
fi
log "Using GUNC db_file: ${DMND}"

gunc run \
  --db_file "${DMND}" \
  --input_fasta "${WORKDIR}/genomes/${SAMPLE}.fasta" \
  --out_dir "${WORKDIR}/gunc/out" \
  --threads "${THREADS}" \
  --detailed_output \
  --contig_taxonomy_output \
  --use_species_level

ALL_LEVELS="$(find "${WORKDIR}/gunc/out" -name "*all_levels.tsv" | head -n 1 || true)"
test -n "${ALL_LEVELS}"
log "Found GUNC all_levels.tsv: ${ALL_LEVELS}"

# =========================
# 5) Parse outputs and write Table 1 TSV
log "Parse outputs → ${OUT_TSV}"
export SAMPLE WORKDIR OUT_TSV GUNC_ALL_LEVELS="${ALL_LEVELS}"

python - <<'PY'
import csv, os

sample = os.environ["SAMPLE"]
workdir = os.environ["WORKDIR"]
out_tsv = os.environ["OUT_TSV"]
gunc_all_levels = os.environ["GUNC_ALL_LEVELS"]

quast_tsv = os.path.join(workdir, "quast", "report.tsv")
mos_summary = os.path.join(workdir, "map", f"{sample}.mosdepth.summary.txt")
checkm_sum = os.path.join(workdir, "checkm", "checkm_summary.tsv")

def read_quast(path):
    with open(path, newline="") as f:
        rows = list(csv.reader(f, delimiter="\t"))
    asm_idx = 1
    d = {}
    for r in rows[1:]:
        if not r: continue
        key = r[0].strip()
        val = r[asm_idx].strip() if asm_idx < len(r) else ""
        d[key] = val
    return d

def read_mosdepth(path):
    with open(path) as f:
        for line in f:
            if line.startswith("chrom"): continue
            parts = line.rstrip("\n").split("\t")
            if len(parts) >= 4 and parts[0] == "total":
                return parts[3]
    return ""

def read_checkm(path, sample):
    with open(path, newline="") as f:
        reader = csv.DictReader(f, delimiter="\t")
        for row in reader:
            bid = row.get("Bin Id") or row.get("Bin") or row.get("bin_id") or ""
            if bid == sample:
                return row
    return {}

def read_gunc_all_levels(path):
    coarse_lvls = {"kingdom","phylum","class"}
    fine_lvls = {"order","family","genus","species"}
    coarse, fine = [], []
    best_line = None
    rank = {"kingdom":0,"phylum":1,"class":2,"order":3,"family":4,"genus":5,"species":6}
    best_rank = -1

    with open(path, newline="") as f:
        reader = csv.DictReader(f, delimiter="\t")
        for row in reader:
            lvl = (row.get("taxonomic_level") or "").strip()
            p = row.get("proportion_genes_retained_in_major_clades") or ""
            try:
                pv = float(p)
            except:
                pv = None
            if pv is not None:
                if lvl in coarse_lvls: coarse.append(pv)
                if lvl in fine_lvls: fine.append(pv)
            if lvl in rank and rank[lvl] > best_rank:
                best_rank = rank[lvl]
                best_line = row

    coarse_mean = sum(coarse)/len(coarse) if coarse else ""
    fine_mean = sum(fine)/len(fine) if fine else ""
    contamination_portion = best_line.get("contamination_portion","") if best_line else ""
    pass_gunc = best_line.get("pass.GUNC","") if best_line else ""
    return coarse_mean, fine_mean, contamination_portion, pass_gunc

qu = read_quast(quast_tsv)
mean_depth = read_mosdepth(mos_summary)
ck = read_checkm(checkm_sum, sample)
coarse_mean, fine_mean, contamination_portion, pass_gunc = read_gunc_all_levels(gunc_all_levels)

header = [
    "Sample",
    "Genome_length_bp",
    "Contigs",
    "N50_bp",
    "L50",
    "GC_percent",
    "Mean_depth_x",
    "CheckM_completeness_percent",
    "CheckM_contamination_percent",
    "CheckM_strain_heterogeneity_percent",
    "GUNC_coarse_consistency",
    "GUNC_fine_consistency",
    "GUNC_contamination_portion",
    "GUNC_pass"
]

row = [
    sample,
    qu.get("Total length", ""),
    qu.get("# contigs", ""),
    qu.get("N50", ""),
    qu.get("L50", ""),
    qu.get("GC (%)", ""),
    mean_depth,
    ck.get("Completeness", ""),
    ck.get("Contamination", ""),
    ck.get("Strain heterogeneity", ""),
    f"{coarse_mean:.4f}" if isinstance(coarse_mean, float) else coarse_mean,
    f"{fine_mean:.4f}" if isinstance(fine_mean, float) else fine_mean,
    contamination_portion,
    pass_gunc
]

with open(out_tsv, "w", newline="") as f:
    w = csv.writer(f, delimiter="\t")
    w.writerow(header)
    w.writerow(row)

print(f"OK: wrote {out_tsv}")
PY

log "SUCCESS"
log "Output TSV: ${OUT_TSV}"
log "Workdir: ${WORKDIR}"
log "Logfile: ${LOGFILE}"

export_table1_stats_to_excel_py36_compat.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Export a comprehensive Excel workbook from a Table1 pipeline workdir.
Python 3.6 compatible (no PEP604 unions, no builtin generics).
Requires: openpyxl

Sheets (as available):
- Summary
- Table1 (if Table1_*.tsv exists)
- QUAST_report (report.tsv)
- QUAST_metrics (metric/value)
- Mosdepth_summary (*.mosdepth.summary.txt)
- CheckM (checkm_summary.tsv)
- GUNC_* (all .tsv under gunc/out)
- File_Inventory (relative path, size, mtime; optional md5 for small files)
- Run_log_preview (head/tail of latest log under workdir/logs or workdir/*/logs)
"""

from __future__ import print_function

import argparse
import csv
import hashlib
import os
import sys
import time
from pathlib import Path

try:
    from openpyxl import Workbook
    from openpyxl.utils import get_column_letter
except ImportError:
    sys.stderr.write("ERROR: openpyxl is required. Install with:\n"
                     "  conda install -c conda-forge openpyxl\n")
    raise

MAX_XLSX_ROWS = 1048576

def safe_sheet_name(name, used):
    # Excel: <=31 chars, cannot contain: : \ / ? * [ ]
    bad = r'[:\\/?*\[\]]'
    base = name.strip() or "Sheet"
    base = __import__("re").sub(bad, "_", base)
    base = base[:31]
    if base not in used:
        used.add(base)
        return base
    # make unique with suffix
    for i in range(2, 1000):
        suffix = "_%d" % i
        cut = 31 - len(suffix)
        candidate = (base[:cut] + suffix)
        if candidate not in used:
            used.add(candidate)
            return candidate
    raise RuntimeError("Too many duplicate sheet names for base=%s" % base)

def autosize(ws, max_width=60):
    for col in ws.columns:
        max_len = 0
        col_letter = get_column_letter(col[0].column)
        for cell in col:
            v = cell.value
            if v is None:
                continue
            s = str(v)
            if len(s) > max_len:
                max_len = len(s)
        ws.column_dimensions[col_letter].width = min(max_width, max(10, max_len + 2))

def write_table(ws, header, rows, max_rows=None):
    if header:
        ws.append(header)
    count = 0
    for r in rows:
        ws.append(r)
        count += 1
        if max_rows is not None and count >= max_rows:
            break

def read_tsv(path, max_rows=None):
    header = []
    rows = []
    with path.open("r", newline="") as f:
        reader = csv.reader(f, delimiter="\t")
        for i, r in enumerate(reader):
            if i == 0:
                header = r
                continue
            rows.append(r)
            if max_rows is not None and len(rows) >= max_rows:
                break
    return header, rows

def read_text_table(path, max_rows=None):
    # for mosdepth summary (tsv with header)
    return read_tsv(path, max_rows=max_rows)

def md5_file(path, chunk=1024*1024):
    h = hashlib.md5()
    with path.open("rb") as f:
        while True:
            b = f.read(chunk)
            if not b:
                break
            h.update(b)
    return h.hexdigest()

def find_latest_log(workdir):
    candidates = []
    # common locations
    for p in [workdir / "logs", workdir / "log", workdir / "Logs"]:
        if p.exists():
            candidates.extend(p.glob("*.log"))
    # nested logs
    candidates.extend(workdir.glob("**/logs/*.log"))
    if not candidates:
        return None
    candidates.sort(key=lambda x: x.stat().st_mtime, reverse=True)
    return candidates[0]

def add_summary_sheet(wb, used, info_items):
    ws = wb.create_sheet(title=safe_sheet_name("Summary", used))
    ws.append(["Key", "Value"])
    for k, v in info_items:
        ws.append([k, v])
    autosize(ws)

def add_log_preview(wb, used, log_path, head_n=80, tail_n=120):
    if log_path is None or not log_path.exists():
        return
    ws = wb.create_sheet(title=safe_sheet_name("Run_log_preview", used))
    ws.append(["Log path", str(log_path)])
    ws.append([])
    lines = log_path.read_text(errors="replace").splitlines()
    ws.append(["--- HEAD (%d) ---" % head_n])
    for line in lines[:head_n]:
        ws.append([line])
    ws.append([])
    ws.append(["--- TAIL (%d) ---" % tail_n])
    for line in lines[-tail_n:]:
        ws.append([line])
    ws.column_dimensions["A"].width = 120

def add_file_inventory(wb, used, workdir, do_md5=True, md5_max_bytes=200*1024*1024, max_rows=None):
    ws = wb.create_sheet(title=safe_sheet_name("File_Inventory", used))
    ws.append(["relative_path", "size_bytes", "mtime_iso", "md5(optional)"])
    count = 0
    for p in sorted(workdir.rglob("*")):
        if p.is_dir():
            continue
        rel = str(p.relative_to(workdir))
        st = p.stat()
        mtime = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(st.st_mtime))
        md5 = ""
        if do_md5 and st.st_size <= md5_max_bytes:
            try:
                md5 = md5_file(p)
            except Exception:
                md5 = "ERROR"
        ws.append([rel, st.st_size, mtime, md5])
        count += 1
        if max_rows is not None and count >= max_rows:
            break
    autosize(ws, max_width=80)

def add_tsv_sheet(wb, used, name, path, max_rows=None):
    header, rows = read_tsv(path, max_rows=max_rows)
    ws = wb.create_sheet(title=safe_sheet_name(name, used))
    write_table(ws, header, rows, max_rows=max_rows)
    autosize(ws, max_width=80)

def add_quast_metrics_sheet(wb, used, quast_report_tsv):
    header, rows = read_tsv(quast_report_tsv, max_rows=None)
    if not header or len(header) < 2:
        return
    asm_name = header[1]
    ws = wb.create_sheet(title=safe_sheet_name("QUAST_metrics", used))
    ws.append(["Metric", asm_name])
    for r in rows:
        if not r:
            continue
        metric = r[0]
        val = r[1] if len(r) > 1 else ""
        ws.append([metric, val])
    autosize(ws, max_width=80)

def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--workdir", required=True, help="workdir produced by pipeline (e.g., table1_GE11174_work)")
    ap.add_argument("--out", required=True, help="output .xlsx")
    ap.add_argument("--sample", default="", help="sample name for summary")
    ap.add_argument("--max-rows", type=int, default=200000, help="max rows per large sheet")
    ap.add_argument("--no-md5", action="store_true", help="skip md5 calculation in File_Inventory")
    args = ap.parse_args()

    workdir = Path(args.workdir).resolve()
    out = Path(args.out).resolve()

    if not workdir.exists():
        sys.stderr.write("ERROR: workdir not found: %s\n" % workdir)
        sys.exit(2)

    wb = Workbook()
    # remove default sheet
    wb.remove(wb.active)
    used = set()

    # Summary info
    info = [
        ("sample", args.sample or ""),
        ("workdir", str(workdir)),
        ("generated_at", time.strftime("%Y-%m-%d %H:%M:%S")),
        ("python", sys.version.replace("\n", " ")),
        ("openpyxl", __import__("openpyxl").__version__),
    ]
    add_summary_sheet(wb, used, info)

    # Table1 TSV (try common names)
    table1_candidates = list(workdir.glob("Table1_*.tsv")) + list(workdir.glob("*.tsv"))
    # Prefer Table1_*.tsv in workdir root
    table1_path = None
    for p in table1_candidates:
        if p.name.startswith("Table1_") and p.suffix == ".tsv":
            table1_path = p
            break
    if table1_path is None:
        # maybe created in cwd, not inside workdir; try alongside workdir
        parent = workdir.parent
        for p in parent.glob("Table1_*.tsv"):
            if args.sample and args.sample in p.name:
                table1_path = p
                break
        if table1_path is None and list(parent.glob("Table1_*.tsv")):
            table1_path = sorted(parent.glob("Table1_*.tsv"))[0]

    if table1_path is not None and table1_path.exists():
        add_tsv_sheet(wb, used, "Table1", table1_path, max_rows=args.max_rows)

    # QUAST
    quast_report = workdir / "quast" / "report.tsv"
    if quast_report.exists():
        add_tsv_sheet(wb, used, "QUAST_report", quast_report, max_rows=args.max_rows)
        add_quast_metrics_sheet(wb, used, quast_report)

    # Mosdepth summary
    for p in sorted((workdir / "map").glob("*.mosdepth.summary.txt")):
        # mosdepth summary is TSV-like
        name = "Mosdepth_" + p.stem.replace(".mosdepth.summary", "")
        add_tsv_sheet(wb, used, name[:31], p, max_rows=args.max_rows)

    # CheckM
    checkm_sum = workdir / "checkm" / "checkm_summary.tsv"
    if checkm_sum.exists():
        add_tsv_sheet(wb, used, "CheckM", checkm_sum, max_rows=args.max_rows)

    # GUNC outputs (all TSV under gunc/out)
    gunc_out = workdir / "gunc" / "out"
    if gunc_out.exists():
        for p in sorted(gunc_out.rglob("*.tsv")):
            rel = str(p.relative_to(gunc_out))
            sheet = "GUNC_" + rel.replace("/", "_").replace("\\", "_").replace(".tsv", "")
            add_tsv_sheet(wb, used, sheet[:31], p, max_rows=args.max_rows)

    # Log preview
    latest_log = find_latest_log(workdir)
    add_log_preview(wb, used, latest_log)

    # File inventory
    add_file_inventory(
        wb, used, workdir,
        do_md5=(not args.no_md5),
        md5_max_bytes=200*1024*1024,
        max_rows=args.max_rows
    )

    # Save
    out.parent.mkdir(parents=True, exist_ok=True)
    wb.save(str(out))
    print("OK: wrote %s" % out)

if __name__ == "__main__":
    main()

run_resistome_virulome_dedup.sh

#!/usr/bin/env bash
set -Eeuo pipefail

# -------- user inputs --------
ENV_NAME="${ENV_NAME:-bengal3_ac3}"
ASM="${ASM:-GE11174.fasta}"
SAMPLE="${SAMPLE:-GE11174}"
OUTDIR="${OUTDIR:-resistome_virulence_${SAMPLE}}"
THREADS="${THREADS:-16}"

# thresholds (set to 0/0 if you truly want ABRicate defaults)
MINID="${MINID:-90}"
MINCOV="${MINCOV:-60}"
# ----------------------------

log(){ echo "[$(date +'%F %T')] $*" >&2; }
need_cmd(){ command -v "$1" >/dev/null 2>&1; }

activate_env() {
  # shellcheck disable=SC1091
  source "$(conda info --base)/etc/profile.d/conda.sh"
  conda activate "${ENV_NAME}"
}

main(){
  activate_env

  mkdir -p "${OUTDIR}"/{raw,amr,virulence,card,tmp}

  log "Env:    ${ENV_NAME}"
  log "ASM:    ${ASM}"
  log "Sample: ${SAMPLE}"
  log "Outdir: ${OUTDIR}"
  log "ABRicate thresholds: MINID=${MINID} MINCOV=${MINCOV}"

  log "ABRicate DB list:"
  abricate --list | egrep -i "vfdb|resfinder|megares|card" || true

  # Make sure indices exist
  log "Running abricate --setupdb (safe even if already done)..."
  abricate --setupdb

  # ---- ABRicate AMR DBs ----
  log "Running ABRicate: ResFinder"
  abricate --db resfinder --minid "${MINID}" --mincov "${MINCOV}" "${ASM}" > "${OUTDIR}/raw/${SAMPLE}.resfinder.tab"

  log "Running ABRicate: MEGARes"
  abricate --db megares   --minid "${MINID}" --mincov "${MINCOV}" "${ASM}" > "${OUTDIR}/raw/${SAMPLE}.megares.tab"

  # ---- Virulence (VFDB) ----
  log "Running ABRicate: VFDB"
  abricate --db vfdb      --minid "${MINID}" --mincov "${MINCOV}" "${ASM}" > "${OUTDIR}/raw/${SAMPLE}.vfdb.tab"

  # ---- CARD: prefer RGI if available, else ABRicate card ----
  CARD_MODE="ABRicate"
  if need_cmd rgi; then
    log "RGI found. Trying RGI (CARD) ..."
    set +e
    rgi main --input_sequence "${ASM}" --output_file "${OUTDIR}/card/${SAMPLE}.rgi" --input_type contig --num_threads "${THREADS}"
    rc=$?
    set -e
    if [[ $rc -eq 0 ]]; then
      CARD_MODE="RGI"
    else
      log "RGI failed (likely CARD data not installed). Falling back to ABRicate card."
    fi
  fi

  if [[ "${CARD_MODE}" == "ABRicate" ]]; then
    log "Running ABRicate: CARD"
    abricate --db card --minid "${MINID}" --mincov "${MINCOV}" "${ASM}" > "${OUTDIR}/raw/${SAMPLE}.card.tab"
  fi

  # ---- Build deduplicated tables ----
  log "Creating deduplicated AMR/VFDB tables..."

  export OUTDIR SAMPLE CARD_MODE
  python - <<'PY'
import os, re
from pathlib import Path
import pandas as pd
from io import StringIO

outdir = Path(os.environ["OUTDIR"])
sample = os.environ["SAMPLE"]
card_mode = os.environ["CARD_MODE"]

def read_abricate_tab(path: Path, source: str) -> pd.DataFrame:
    if not path.exists() or path.stat().st_size == 0:
        return pd.DataFrame()
    lines=[]
    with path.open("r", errors="replace") as f:
        for line in f:
            if line.startswith("#") or not line.strip():
                continue
            lines.append(line)
    if not lines:
        return pd.DataFrame()
    df = pd.read_csv(StringIO("".join(lines)), sep="\t", dtype=str)
    df.insert(0, "Source", source)
    return df

def to_num(s):
    try:
        return float(str(s).replace("%",""))
    except:
        return None

def normalize_abricate(df: pd.DataFrame, dbname: str) -> pd.DataFrame:
    if df.empty:
        return pd.DataFrame(columns=[
            "Source","Database","Gene","Product","Accession","Contig","Start","End","Strand","Pct_Identity","Pct_Coverage"
        ])
    # Column names vary slightly; handle common ones
    gene = "GENE" if "GENE" in df.columns else None
    prod = "PRODUCT" if "PRODUCT" in df.columns else None
    acc  = "ACCESSION" if "ACCESSION" in df.columns else None
    contig = "SEQUENCE" if "SEQUENCE" in df.columns else ("CONTIG" if "CONTIG" in df.columns else None)
    start = "START" if "START" in df.columns else None
    end   = "END" if "END" in df.columns else None
    strand= "STRAND" if "STRAND" in df.columns else None

    pid = "%IDENTITY" if "%IDENTITY" in df.columns else ("% Identity" if "% Identity" in df.columns else None)
    pcv = "%COVERAGE" if "%COVERAGE" in df.columns else ("% Coverage" if "% Coverage" in df.columns else None)

    out = pd.DataFrame()
    out["Source"] = df["Source"]
    out["Database"] = dbname
    out["Gene"] = df[gene] if gene else ""
    out["Product"] = df[prod] if prod else ""
    out["Accession"] = df[acc] if acc else ""
    out["Contig"] = df[contig] if contig else ""
    out["Start"] = df[start] if start else ""
    out["End"] = df[end] if end else ""
    out["Strand"] = df[strand] if strand else ""
    out["Pct_Identity"] = df[pid] if pid else ""
    out["Pct_Coverage"] = df[pcv] if pcv else ""
    return out

def dedup_best(df: pd.DataFrame, key_cols):
    """Keep best hit per key by highest identity, then coverage, then longest span."""
    if df.empty:
        return df
    # numeric helpers
    df = df.copy()
    df["_pid"] = df["Pct_Identity"].map(to_num)
    df["_pcv"] = df["Pct_Coverage"].map(to_num)

    def span(row):
        try:
            return abs(int(row["End"]) - int(row["Start"])) + 1
        except:
            return 0
    df["_span"] = df.apply(span, axis=1)

    # sort best-first
    df = df.sort_values(by=["_pid","_pcv","_span"], ascending=[False,False,False], na_position="last")
    df = df.drop_duplicates(subset=key_cols, keep="first")
    df = df.drop(columns=["_pid","_pcv","_span"])
    return df

# ---------- AMR inputs ----------
amr_frames = []

# ResFinder (often 0 hits; still okay)
resfinder = outdir / "raw" / f"{sample}.resfinder.tab"
df = read_abricate_tab(resfinder, "ABRicate")
amr_frames.append(normalize_abricate(df, "ResFinder"))

# MEGARes
megares = outdir / "raw" / f"{sample}.megares.tab"
df = read_abricate_tab(megares, "ABRicate")
amr_frames.append(normalize_abricate(df, "MEGARes"))

# CARD: RGI or ABRicate
if card_mode == "RGI":
    # Try common RGI tab outputs
    prefix = outdir / "card" / f"{sample}.rgi"
    rgi_tab = None
    for ext in [".txt",".tab",".tsv"]:
        p = Path(str(prefix) + ext)
        if p.exists() and p.stat().st_size > 0:
            rgi_tab = p
            break
    if rgi_tab is not None:
        rgi = pd.read_csv(rgi_tab, sep="\t", dtype=str)
        out = pd.DataFrame()
        out["Source"] = "RGI"
        out["Database"] = "CARD"
        # Prefer ARO_name/Best_Hit_ARO if present
        out["Gene"] = rgi["ARO_name"] if "ARO_name" in rgi.columns else (rgi["Best_Hit_ARO"] if "Best_Hit_ARO" in rgi.columns else "")
        out["Product"] = rgi["ARO_name"] if "ARO_name" in rgi.columns else ""
        out["Accession"] = rgi["ARO_accession"] if "ARO_accession" in rgi.columns else ""
        out["Contig"] = rgi["Sequence"] if "Sequence" in rgi.columns else ""
        out["Start"] = rgi["Start"] if "Start" in rgi.columns else ""
        out["End"] = rgi["Stop"] if "Stop" in rgi.columns else (rgi["End"] if "End" in rgi.columns else "")
        out["Strand"] = rgi["Orientation"] if "Orientation" in rgi.columns else ""
        out["Pct_Identity"] = rgi["% Identity"] if "% Identity" in rgi.columns else ""
        out["Pct_Coverage"] = rgi["% Coverage"] if "% Coverage" in rgi.columns else ""
        amr_frames.append(out)
else:
    card = outdir / "raw" / f"{sample}.card.tab"
    df = read_abricate_tab(card, "ABRicate")
    amr_frames.append(normalize_abricate(df, "CARD"))

amr_all = pd.concat([x for x in amr_frames if not x.empty], ignore_index=True) if any(not x.empty for x in amr_frames) else pd.DataFrame(
    columns=["Source","Database","Gene","Product","Accession","Contig","Start","End","Strand","Pct_Identity","Pct_Coverage"]
)

# Deduplicate within each (Database,Gene) – this is usually what you want for manuscript tables
amr_dedup = dedup_best(amr_all, key_cols=["Database","Gene"])

# Sort nicely
if not amr_dedup.empty:
    amr_dedup = amr_dedup.sort_values(["Database","Gene"]).reset_index(drop=True)

amr_out = outdir / "Table_AMR_genes_dedup.tsv"
amr_dedup.to_csv(amr_out, sep="\t", index=False)

# ---------- Virulence (VFDB) ----------
vfdb = outdir / "raw" / f"{sample}.vfdb.tab"
vf = read_abricate_tab(vfdb, "ABRicate")
vf_norm = normalize_abricate(vf, "VFDB")

# Dedup within (Gene) for VFDB (or use Database,Gene; Database constant)
vf_dedup = dedup_best(vf_norm, key_cols=["Gene"]) if not vf_norm.empty else vf_norm
if not vf_dedup.empty:
    vf_dedup = vf_dedup.sort_values(["Gene"]).reset_index(drop=True)

vf_out = outdir / "Table_Virulence_VFDB_dedup.tsv"
vf_dedup.to_csv(vf_out, sep="\t", index=False)

print("OK wrote:")
print(" ", amr_out)
print(" ", vf_out)
PY

  log "Done."
  log "Outputs:"
  log "  ${OUTDIR}/Table_AMR_genes_dedup.tsv"
  log "  ${OUTDIR}/Table_Virulence_VFDB_dedup.tsv"
  log "Raw:"
  log "  ${OUTDIR}/raw/${SAMPLE}.*.tab"
}

main

run_abricate_resistome_virulome_one_per_gene.sh

#!/usr/bin/env bash
set -Eeuo pipefail

# ------------------- USER SETTINGS -------------------
ENV_NAME="${ENV_NAME:-bengal3_ac3}"

ASM="${ASM:-GE11174.fasta}"          # input assembly fasta
SAMPLE="${SAMPLE:-GE11174}"

OUTDIR="${OUTDIR:-resistome_virulence_${SAMPLE}}"
THREADS="${THREADS:-16}"

# ABRicate thresholds
# If you want your earlier "35 genes" behavior, use MINID=70 MINCOV=50.
# If you want stricter: e.g. MINID=80 MINCOV=70.
MINID="${MINID:-70}"
MINCOV="${MINCOV:-50}"
# -----------------------------------------------------

ts(){ date +"%F %T"; }
log(){ echo "[$(ts)] $*" >&2; }

on_err(){
  local ec=$?
  log "ERROR: failed (exit=${ec}) at line ${BASH_LINENO[0]}: ${BASH_COMMAND}"
  exit $ec
}
trap on_err ERR

need_cmd(){ command -v "$1" >/dev/null 2>&1; }

activate_env() {
  # shellcheck disable=SC1091
  source "$(conda info --base)/etc/profile.d/conda.sh"
  conda activate "${ENV_NAME}"
}

main(){
  activate_env

  log "Env: ${ENV_NAME}"
  log "ASM: ${ASM}"
  log "Sample: ${SAMPLE}"
  log "Outdir: ${OUTDIR}"
  log "Threads: ${THREADS}"
  log "ABRicate thresholds: MINID=${MINID} MINCOV=${MINCOV}"

  mkdir -p "${OUTDIR}"/{raw,logs}

  # Save full log
  LOGFILE="${OUTDIR}/logs/run_$(date +'%F_%H%M%S').log"
  exec > >(tee -a "${LOGFILE}") 2>&1

  log "Tool versions:"
  abricate --version || true
  abricate-get_db --help | head -n 5 || true

  log "ABRicate DB list (selected):"
  abricate --list | egrep -i "vfdb|resfinder|megares|card" || true

  log "Indexing ABRicate databases (safe to re-run)..."
  abricate --setupdb

  # ---------------- Run ABRicate ----------------
  log "Running ABRicate: MEGARes"
  abricate --db megares   --minid "${MINID}" --mincov "${MINCOV}" "${ASM}" > "${OUTDIR}/raw/${SAMPLE}.megares.tab"

  log "Running ABRicate: CARD"
  abricate --db card      --minid "${MINID}" --mincov "${MINCOV}" "${ASM}" > "${OUTDIR}/raw/${SAMPLE}.card.tab"

  log "Running ABRicate: ResFinder"
  abricate --db resfinder --minid "${MINID}" --mincov "${MINCOV}" "${ASM}" > "${OUTDIR}/raw/${SAMPLE}.resfinder.tab"

  log "Running ABRicate: VFDB"
  abricate --db vfdb      --minid "${MINID}" --mincov "${MINCOV}" "${ASM}" > "${OUTDIR}/raw/${SAMPLE}.vfdb.tab"

  # --------------- Build tables -----------------
  export OUTDIR SAMPLE
  export MEGARES_TAB="${OUTDIR}/raw/${SAMPLE}.megares.tab"
  export CARD_TAB="${OUTDIR}/raw/${SAMPLE}.card.tab"
  export RESFINDER_TAB="${OUTDIR}/raw/${SAMPLE}.resfinder.tab"
  export VFDB_TAB="${OUTDIR}/raw/${SAMPLE}.vfdb.tab"

  export AMR_OUT="${OUTDIR}/Table_AMR_genes_one_per_gene.tsv"
  export VIR_OUT="${OUTDIR}/Table_Virulence_VFDB_dedup.tsv"
  export STATUS_OUT="${OUTDIR}/Table_DB_hit_counts.tsv"

  log "Generating deduplicated tables..."
  python - <<'PY'
import os
import pandas as pd
from pathlib import Path

megares_tab   = Path(os.environ["MEGARES_TAB"])
card_tab      = Path(os.environ["CARD_TAB"])
resfinder_tab = Path(os.environ["RESFINDER_TAB"])
vfdb_tab      = Path(os.environ["VFDB_TAB"])

amr_out    = Path(os.environ["AMR_OUT"])
vir_out    = Path(os.environ["VIR_OUT"])
status_out = Path(os.environ["STATUS_OUT"])

def read_abricate(path: Path) -> pd.DataFrame:
    """Parse ABRicate .tab where header line starts with '#FILE'."""
    if (not path.exists()) or path.stat().st_size == 0:
        return pd.DataFrame()
    header = None
    rows = []
    with path.open("r", errors="replace") as f:
        for line in f:
            if not line.strip():
                continue
            if line.startswith("#FILE"):
                header = line.lstrip("#").rstrip("\n").split("\t")
                continue
            if line.startswith("#"):
                continue
            rows.append(line.rstrip("\n").split("\t"))
    if header is None:
        return pd.DataFrame()
    if not rows:
        return pd.DataFrame(columns=header)
    return pd.DataFrame(rows, columns=header)

def normalize(df: pd.DataFrame, dbname: str) -> pd.DataFrame:
    cols_out = ["Database","Gene","Product","Accession","Contig","Start","End","Strand","Pct_Identity","Pct_Coverage"]
    if df is None or df.empty:
        return pd.DataFrame(columns=cols_out)
    out = pd.DataFrame({
        "Database": dbname,
        "Gene": df.get("GENE",""),
        "Product": df.get("PRODUCT",""),
        "Accession": df.get("ACCESSION",""),
        "Contig": df.get("SEQUENCE",""),
        "Start": df.get("START",""),
        "End": df.get("END",""),
        "Strand": df.get("STRAND",""),
        "Pct_Identity": pd.to_numeric(df.get("%IDENTITY",""), errors="coerce"),
        "Pct_Coverage": pd.to_numeric(df.get("%COVERAGE",""), errors="coerce"),
    })
    return out[cols_out]

def best_hit_dedup(df: pd.DataFrame, key_cols):
    """Keep best hit by highest identity, then coverage, then alignment length."""
    if df.empty:
        return df
    d = df.copy()
    d["Start_i"] = pd.to_numeric(d["Start"], errors="coerce").fillna(0).astype(int)
    d["End_i"]   = pd.to_numeric(d["End"], errors="coerce").fillna(0).astype(int)
    d["Len"]     = (d["End_i"] - d["Start_i"]).abs() + 1
    d = d.sort_values(["Pct_Identity","Pct_Coverage","Len"], ascending=[False,False,False])
    d = d.drop_duplicates(subset=key_cols, keep="first")
    return d.drop(columns=["Start_i","End_i","Len"])

def count_hits(path: Path) -> int:
    if not path.exists():
        return 0
    n = 0
    with path.open() as f:
        for line in f:
            if line.startswith("#") or not line.strip():
                continue
            n += 1
    return n

# -------- load + normalize --------
parts = []
for dbname, p in [("MEGARes", megares_tab), ("CARD", card_tab), ("ResFinder", resfinder_tab)]:
    df = read_abricate(p)
    parts.append(normalize(df, dbname))

amr_all = pd.concat([x for x in parts if not x.empty], ignore_index=True) if any(not x.empty for x in parts) else pd.DataFrame(
    columns=["Database","Gene","Product","Accession","Contig","Start","End","Strand","Pct_Identity","Pct_Coverage"]
)

# remove empty genes
amr_all = amr_all[amr_all["Gene"].astype(str).str.len() > 0].copy()

# best per (Database,Gene)
amr_db_gene = best_hit_dedup(amr_all, ["Database","Gene"]) if not amr_all.empty else amr_all

# one row per Gene overall, priority: CARD > ResFinder > MEGARes
priority = {"CARD": 0, "ResFinder": 1, "MEGARes": 2}
if not amr_db_gene.empty:
    amr_db_gene["prio"] = amr_db_gene["Database"].map(priority).fillna(9).astype(int)
    amr_one = amr_db_gene.sort_values(
        ["Gene","prio","Pct_Identity","Pct_Coverage"],
        ascending=[True, True, False, False]
    )
    amr_one = amr_one.drop_duplicates(["Gene"], keep="first").drop(columns=["prio"])
    amr_one = amr_one.sort_values(["Gene"]).reset_index(drop=True)
else:
    amr_one = amr_db_gene

amr_out.parent.mkdir(parents=True, exist_ok=True)
amr_one.to_csv(amr_out, sep="\t", index=False)

# -------- VFDB --------
vf = normalize(read_abricate(vfdb_tab), "VFDB")
vf = vf[vf["Gene"].astype(str).str.len() > 0].copy()
vf_one = best_hit_dedup(vf, ["Gene"]) if not vf.empty else vf
if not vf_one.empty:
    vf_one = vf_one.sort_values(["Gene"]).reset_index(drop=True)

vir_out.parent.mkdir(parents=True, exist_ok=True)
vf_one.to_csv(vir_out, sep="\t", index=False)

# -------- status counts --------
status = pd.DataFrame([
    {"Database":"MEGARes",   "Hit_lines": count_hits(megares_tab),   "File": str(megares_tab)},
    {"Database":"CARD",      "Hit_lines": count_hits(card_tab),      "File": str(card_tab)},
    {"Database":"ResFinder", "Hit_lines": count_hits(resfinder_tab), "File": str(resfinder_tab)},
    {"Database":"VFDB",      "Hit_lines": count_hits(vfdb_tab),      "File": str(vfdb_tab)},
])
status_out.parent.mkdir(parents=True, exist_ok=True)
status.to_csv(status_out, sep="\t", index=False)

print("OK wrote:")
print(" ", amr_out, "rows=", len(amr_one))
print(" ", vir_out, "rows=", len(vf_one))
print(" ", status_out)
PY

  log "Finished."
  log "Main outputs:"
  log "  ${AMR_OUT}"
  log "  ${VIR_OUT}"
  log "  ${STATUS_OUT}"
  log "Raw ABRicate outputs:"
  log "  ${OUTDIR}/raw/${SAMPLE}.megares.tab"
  log "  ${OUTDIR}/raw/${SAMPLE}.card.tab"
  log "  ${OUTDIR}/raw/${SAMPLE}.resfinder.tab"
  log "  ${OUTDIR}/raw/${SAMPLE}.vfdb.tab"
  log "Log:"
  log "  ${LOGFILE}"
}

main

resolve_best_assemblies_entrez.py

#!/usr/bin/env python3
import csv
import os
import re
import sys
import time
from dataclasses import dataclass
from typing import List, Optional, Tuple

from Bio import Entrez

# REQUIRED by NCBI policy
Entrez.email = os.environ.get("NCBI_EMAIL", "your.email@example.com")

# Be nice to NCBI
ENTREZ_DELAY_SEC = float(os.environ.get("ENTREZ_DELAY_SEC", "0.34"))

LEVEL_RANK = {
    "Complete Genome": 0,
    "Chromosome": 1,
    "Scaffold": 2,
    "Contig": 3,
    # sometimes NCBI uses slightly different strings:
    "complete genome": 0,
    "chromosome": 1,
    "scaffold": 2,
    "contig": 3,
}

def level_rank(level: str) -> int:
    return LEVEL_RANK.get(level.strip(), 99)

def is_refseq(accession: str) -> bool:
    return accession.startswith("GCF_")

@dataclass
class AssemblyHit:
    assembly_uid: str
    assembly_accession: str   # GCF_... or GCA_...
    organism: str
    strain: str
    assembly_level: str
    refseq_category: str
    submitter: str
    ftp_path: str

def entrez_search_assembly(term: str, retmax: int = 50) -> List[str]:
    """Return Assembly UIDs matching term."""
    h = Entrez.esearch(db="assembly", term=term, retmax=str(retmax))
    rec = Entrez.read(h)
    h.close()
    time.sleep(ENTREZ_DELAY_SEC)
    return rec.get("IdList", [])

def entrez_esummary_assembly(uids: List[str]) -> List[AssemblyHit]:
    """Fetch assembly summary records for given UIDs."""
    if not uids:
        return []
    h = Entrez.esummary(db="assembly", id=",".join(uids), report="full")
    rec = Entrez.read(h)
    h.close()
    time.sleep(ENTREZ_DELAY_SEC)

    hits: List[AssemblyHit] = []
    docs = rec.get("DocumentSummarySet", {}).get("DocumentSummary", [])
    for d in docs:
        # Some fields can be missing
        acc = str(d.get("AssemblyAccession", "")).strip()
        org = str(d.get("Organism", "")).strip()
        level = str(d.get("AssemblyStatus", "")).strip() or str(d.get("AssemblyLevel", "")).strip()
        # NCBI uses "AssemblyStatus" sometimes, "AssemblyLevel" other times;
        # in practice AssemblyStatus often equals "Complete Genome"/"Chromosome"/...
        if not level:
            level = str(d.get("AssemblyLevel", "")).strip()

        strain = str(d.get("Biosample", "")).strip()
        # Strain is not always in a clean field. Try "Sub_value" in Meta, or parse Submitter/Title.
        # We'll try a few common places:
        title = str(d.get("AssemblyName", "")).strip()
        submitter = str(d.get("SubmitterOrganization", "")).strip()
        refcat = str(d.get("RefSeq_category", "")).strip()
        ftp = str(d.get("FtpPath_RefSeq", "")).strip() or str(d.get("FtpPath_GenBank", "")).strip()

        hits.append(
            AssemblyHit(
                assembly_uid=str(d.get("Uid", "")),
                assembly_accession=acc,
                organism=org,
                strain=strain,
                assembly_level=level,
                refseq_category=refcat,
                submitter=submitter,
                ftp_path=ftp,
            )
        )
    return hits

def best_hit(hits: List[AssemblyHit]) -> Optional[AssemblyHit]:
    """Pick best hit by level (Complete>Chromosome>...), prefer RefSeq, then prefer representative/reference."""
    if not hits:
        return None

    def key(h: AssemblyHit) -> Tuple[int, int, int, str]:
        # lower is better
        lvl = level_rank(h.assembly_level)
        ref = 0 if is_refseq(h.assembly_accession) else 1

        # prefer reference/representative if present
        cat = (h.refseq_category or "").lower()
        rep = 0
        if "reference" in cat:
            rep = 0
        elif "representative" in cat:
            rep = 1
        else:
            rep = 2

        # tie-breaker: accession string (stable)
        return (lvl, ref, rep, h.assembly_accession)

    return sorted(hits, key=key)[0]

def relaxed_fallback_terms(organism: str, strain_tokens: List[str]) -> List[str]:
    """
    Build fallback search terms:
      1) organism + strain tokens
      2) organism only (species-only)
      3) genus-only (if species fails)
    """
    terms = []
    # 1) Full term: organism + strain tokens
    if strain_tokens:
        t = f'"{organism}"[Organism] AND (' + " OR ".join(f'"{s}"[All Fields]' for s in strain_tokens) + ")"
        terms.append(t)

    # 2) Species only
    terms.append(f'"{organism}"[Organism]')

    # 3) Genus only
    genus = organism.split()[0]
    terms.append(f'"{genus}"[Organism]')

    return terms

def resolve_one(label: str, organism: str, strain_tokens: List[str], retmax: int = 80) -> Tuple[str, Optional[AssemblyHit], str]:
    """
    Returns:
      - selected accession or "NA"
      - selected hit (optional)
      - which query term matched
    """
    for term in relaxed_fallback_terms(organism, strain_tokens):
        uids = entrez_search_assembly(term, retmax=retmax)
        hits = entrez_esummary_assembly(uids)
        chosen = best_hit(hits)
        if chosen and chosen.assembly_accession:
            return chosen.assembly_accession, chosen, term
    return "NA", None, ""

def parse_targets_tsv(path: str) -> List[Tuple[str, str, List[str]]]:
    """
    Input TSV format:
      label  organism  strain_tokens
    where strain_tokens is a semicolon-separated list, e.g. "FRB97;FRB 97"
    """
    rows = []
    with open(path, newline="") as f:
        r = csv.DictReader(f, delimiter="\t")
        for row in r:
            label = row["label"].strip()
            org = row["organism"].strip()
            tokens = [x.strip() for x in row.get("strain_tokens", "").split(";") if x.strip()]
            rows.append((label, org, tokens))
    return rows

def main():
    if len(sys.argv) < 3:
        print("Usage: resolve_best_assemblies_entrez.py targets.tsv out.tsv", file=sys.stderr)
        sys.exit(2)

    targets_tsv = sys.argv[1]
    out_tsv = sys.argv[2]

    targets = parse_targets_tsv(targets_tsv)

    with open(out_tsv, "w", newline="") as f:
        w = csv.writer(f, delimiter="\t")
        w.writerow(["label", "best_accession", "assembly_level", "refseq_category", "organism", "query_used"])
        for label, org, tokens in targets:
            acc, hit, term = resolve_one(label, org, tokens)
            if hit:
                w.writerow([label, acc, hit.assembly_level, hit.refseq_category, hit.organism, term])
                print(f"[OK] {label} -> {acc} ({hit.assembly_level})")
            else:
                w.writerow([label, "NA", "", "", org, ""])
                print(f"[WARN] {label} -> NA (no assemblies found)")

if __name__ == "__main__":
    main()

build_wgs_tree_fig3B.sh

#!/usr/bin/env bash
set -euo pipefail

###############################################################################
# Core-genome phylogeny pipeline (genome-wide; no 16S/MLST):
#
# Uses existing conda env prefix:
#   ENV_NAME=/home/jhuang/miniconda3/envs/bengal3_ac3
#
# Inputs:
#   - resolved_accessions.tsv
#   - REF.fasta
#
# Also consider these 4 accessions (duplicates removed):
#   GCF_002291425.1, GCF_047901425.1, GCF_004342245.1, GCA_032062225.1
#
# Robustness:
#   - Conda activation hook may reference JAVA_HOME under set -u (handled)
#   - GFF validation ignores the ##FASTA FASTA block (valid GFF3)
#   - FIXED: No more double Roary directories (script no longer pre-creates -f dir)
#            Logs go to WORKDIR/logs and are also copied into the final Roary dir.
#
# Outputs:
#   ${WORKDIR}/plot/core_tree.pdf
#   ${WORKDIR}/plot/core_tree.png
###############################################################################

THREADS="${THREADS:-8}"
WORKDIR="${WORKDIR:-work_wgs_tree}"

RESOLVED_TSV="${RESOLVED_TSV:-resolved_accessions.tsv}"
REF_FASTA="${REF_FASTA:-AN6.fasta}"

ENV_NAME="${ENV_NAME:-/home/jhuang/miniconda3/envs/bengal3_ac3}"

EXTRA_ASSEMBLIES=(
  #"GCF_002291425.1"
  #"GCF_047901425.1"
  #"GCF_004342245.1"
  #"GCA_032062225.1"
)

CLUSTERS_K="${CLUSTERS_K:-6}"

MODE="${1:-all}"

log(){ echo "[$(date +'%F %T')] $*" >&2; }
need_cmd(){ command -v "$1" >/dev/null 2>&1; }

activate_existing_env(){
  if [[ ! -d "${ENV_NAME}" ]]; then
    log "ERROR: ENV_NAME path does not exist: ${ENV_NAME}"
    exit 1
  fi

  conda_base="$(dirname "$(dirname "${ENV_NAME}")")"
  if [[ -f "${conda_base}/etc/profile.d/conda.sh" ]]; then
    # shellcheck disable=SC1091
    source "${conda_base}/etc/profile.d/conda.sh"
  else
    if need_cmd conda; then
      # shellcheck disable=SC1091
      source "$(conda info --base)/etc/profile.d/conda.sh"
    else
      log "ERROR: cannot find conda.sh and conda is not on PATH."
      exit 1
    fi
  fi

  # Avoid "unbound variable" in activation hooks under set -u
  export JAVA_HOME="${JAVA_HOME:-}"

  log "Activating env: ${ENV_NAME}"
  set +u
  conda activate "${ENV_NAME}"
  set -u
}

check_dependencies() {
  # ---- plot-only mode: only need R (and optionally python) ----
  if [[ "${MODE}" == "plot-only" ]]; then
    local missing=()

    command -v Rscript >/dev/null 2>&1 || missing+=("Rscript")
    command -v python  >/dev/null 2>&1 || missing+=("python")

    if (( ${#missing[@]} )); then
      log "ERROR: Missing required tools for plot-only in env: ${ENV_NAME}"
      printf '  - %s\n' "${missing[@]}" >&2
      exit 1
    fi

    # Check required R packages (fail early with clear message)
    Rscript -e 'pkgs <- c("ggtree","ggplot2","aplot");
                miss <- pkgs[!sapply(pkgs, requireNamespace, quietly=TRUE)];
                if(length(miss)) stop("Missing R packages: ", paste(miss, collapse=", "))'

    return 0
  fi
  # ------------------------------------------------------------

  # existing full-pipeline checks continue below...
  # (your current prokka/roary/raxml-ng checks stay as-is)
  #...
}

prepare_accessions(){
  [[ -s "${RESOLVED_TSV}" ]] || { log "ERROR: missing ${RESOLVED_TSV}"; exit 1; }
  mkdir -p "${WORKDIR}/meta"
  printf "%s\n" "${EXTRA_ASSEMBLIES[@]}" > "${WORKDIR}/meta/extras.txt"

  WORKDIR="${WORKDIR}" RESOLVED_TSV="${RESOLVED_TSV}" python - << 'PY'
import os
import pandas as pd
import pathlib

workdir = pathlib.Path(os.environ.get("WORKDIR", "work_wgs_tree"))
resolved_tsv = os.environ.get("RESOLVED_TSV", "resolved_accessions.tsv")

df = pd.read_csv(resolved_tsv, sep="\t")

# Expect columns like: label, best_accession (but be tolerant)
if "best_accession" not in df.columns:
    df = df.rename(columns={df.columns[1]:"best_accession"})
if "label" not in df.columns:
    df = df.rename(columns={df.columns[0]:"label"})

df = df[["label","best_accession"]].dropna()
df = df[df["best_accession"]!="NA"].copy()

extras_path = workdir/"meta/extras.txt"
extras = [x.strip() for x in extras_path.read_text().splitlines() if x.strip()]
extra_df = pd.DataFrame({"label":[f"EXTRA_{a}" for a in extras], "best_accession": extras})

all_df = pd.concat([df, extra_df], ignore_index=True)
all_df = all_df.drop_duplicates(subset=["best_accession"], keep="first").reset_index(drop=True)

out = workdir/"meta/accessions.tsv"
out.parent.mkdir(parents=True, exist_ok=True)
all_df.to_csv(out, sep="\t", index=False)

print("Final unique genomes:", len(all_df))
print(all_df)
print("Wrote:", out)
PY
}

download_genomes(){
  mkdir -p "${WORKDIR}/genomes_ncbi"

  while IFS=$'\t' read -r label acc; do
    [[ "$label" == "label" ]] && continue
    [[ -z "${acc}" ]] && continue

    outdir="${WORKDIR}/genomes_ncbi/${acc}"
    if [[ -d "${outdir}" ]]; then
      log "Found ${acc}, skipping download"
      continue
    fi

    log "Downloading ${acc}..."
    datasets download genome accession "${acc}" --include genome --filename "${WORKDIR}/genomes_ncbi/${acc}.zip"
    unzip -q "${WORKDIR}/genomes_ncbi/${acc}.zip" -d "${outdir}"
    rm -f "${WORKDIR}/genomes_ncbi/${acc}.zip"
  done < "${WORKDIR}/meta/accessions.tsv"
}

collect_fastas(){
  mkdir -p "${WORKDIR}/fastas"

  while IFS=$'\t' read -r label acc; do
    [[ "$label" == "label" ]] && continue
    [[ -z "${acc}" ]] && continue

    fna="$(find "${WORKDIR}/genomes_ncbi/${acc}" -type f -name "*.fna" | head -n 1 || true)"
    [[ -n "${fna}" ]] || { log "ERROR: .fna not found for ${acc}"; exit 1; }
    cp -f "${fna}" "${WORKDIR}/fastas/${acc}.fna"
  done < "${WORKDIR}/meta/accessions.tsv"

  [[ -s "${REF_FASTA}" ]] || { log "ERROR: missing ${REF_FASTA}"; exit 1; }
  cp -f "${REF_FASTA}" "${WORKDIR}/fastas/REF.fna"
}

run_prokka(){
  mkdir -p "${WORKDIR}/prokka" "${WORKDIR}/gffs"

  for fna in "${WORKDIR}/fastas/"*.fna; do
    base="$(basename "${fna}" .fna)"
    outdir="${WORKDIR}/prokka/${base}"
    gffout="${WORKDIR}/gffs/${base}.gff"

    if [[ -s "${gffout}" ]]; then
      log "GFF exists for ${base}, skipping Prokka"
      continue
    fi

    log "Prokka annotating ${base}..."
    prokka --outdir "${outdir}" --prefix "${base}" --cpus "${THREADS}" --force "${fna}"
    cp -f "${outdir}/${base}.gff" "${gffout}"
  done
}

sanitize_and_check_gffs(){
  log "Sanity checking GFFs (ignoring ##FASTA section)..."
  for gff in "${WORKDIR}/gffs/"*.gff; do
    if file "$gff" | grep -qi "CRLF"; then
      log "Fixing CRLF -> LF in $(basename "$gff")"
      sed -i 's/\r$//' "$gff"
    fi

    bad=$(awk '
      BEGIN{bad=0; in_fasta=0}
      /^##FASTA/{in_fasta=1; next}
      in_fasta==1{next}
      /^#/{next}
      NF==0{next}
      {
        if (split($0,a,"\t")!=9) {bad=1}
      }
      END{print bad}
    ' "$gff")

    if [[ "$bad" == "1" ]]; then
      log "ERROR: GFF feature section not 9-column tab-delimited: $gff"
      log "First 5 problematic feature lines (before ##FASTA):"
      awk '
        BEGIN{in_fasta=0; c=0}
        /^##FASTA/{in_fasta=1; next}
        in_fasta==1{next}
        /^#/{next}
        NF==0{next}
        {
          if (split($0,a,"\t")!=9) {
            print
            c++
            if (c==5) exit
          }
        }
      ' "$gff" || true
      exit 1
    fi
  done
}

run_roary(){
  mkdir -p "${WORKDIR}/meta" "${WORKDIR}/logs"

  ts="$(date +%s)"
  run_id="${ts}_$$"
  ROARY_OUT="${WORKDIR}/roary_${run_id}"

  ROARY_STDOUT="${WORKDIR}/logs/roary_${run_id}.stdout.txt"
  ROARY_STDERR="${WORKDIR}/logs/roary_${run_id}.stderr.txt"

  MARKER="${WORKDIR}/meta/roary_${run_id}.start"
  : > "${MARKER}"

  log "Running Roary (outdir: ${ROARY_OUT})"
  log "Roary logs:"
  log "  STDOUT: ${ROARY_STDOUT}"
  log "  STDERR: ${ROARY_STDERR}"

  set +e
  roary -e --mafft -p "${THREADS}" -cd 95 -i 95 \
    -f "${ROARY_OUT}" "${WORKDIR}/gffs/"*.gff \
    > "${ROARY_STDOUT}" 2> "${ROARY_STDERR}"
  rc=$?
  set -e

  if [[ "${rc}" -ne 0 ]]; then
    log "WARNING: Roary exited non-zero (rc=${rc}). Will check if core alignment was produced anyway."
  fi

  CORE_ALN="$(find "${WORKDIR}" -maxdepth 2 -type f -name "core_gene_alignment.aln" -newer "${MARKER}" -printf '%T@ %p\n' 2>/dev/null \
    | sort -nr | head -n 1 | cut -d' ' -f2- || true)"

  if [[ -z "${CORE_ALN}" || ! -s "${CORE_ALN}" ]]; then
    log "ERROR: Could not find core_gene_alignment.aln produced by this Roary run under ${WORKDIR}"
    log "---- STDERR (head) ----"
    head -n 120 "${ROARY_STDERR}" 2>/dev/null || true
    log "---- STDERR (tail) ----"
    tail -n 120 "${ROARY_STDERR}" 2>/dev/null || true
    exit 1
  fi

  CORE_DIR="$(dirname "${CORE_ALN}")"
  cp -f "${ROARY_STDOUT}" "${CORE_DIR}/roary.stdout.txt" || true
  cp -f "${ROARY_STDERR}" "${CORE_DIR}/roary.stderr.txt" || true

  # >>> IMPORTANT FIX: store ABSOLUTE path <<<
  CORE_ALN_ABS="$(readlink -f "${CORE_ALN}")"
  log "Using core alignment: ${CORE_ALN_ABS}"

  echo "${CORE_ALN_ABS}" > "${WORKDIR}/meta/core_alignment_path.txt"
  echo "$(readlink -f "${CORE_DIR}")" > "${WORKDIR}/meta/roary_output_dir.txt"
}

run_raxmlng(){
  mkdir -p "${WORKDIR}/raxmlng"

  CORE_ALN="$(cat "${WORKDIR}/meta/core_alignment_path.txt")"
  [[ -s "${CORE_ALN}" ]] || { log "ERROR: core alignment not found or empty: ${CORE_ALN}"; exit 1; }

  log "Running RAxML-NG..."
  raxml-ng --all \
    --msa "${CORE_ALN}" \
    --model GTR+G \
    --bs-trees 1000 \
    --threads "${THREADS}" \
    --prefix "${WORKDIR}/raxmlng/core"
}

ensure_r_pkgs(){
  Rscript - <<'RS'
need <- c("ape","ggplot2","dplyr","readr","aplot","ggtree")
missing <- need[!vapply(need, requireNamespace, logical(1), quietly=TRUE)]
if (length(missing)) {
  message("Missing R packages: ", paste(missing, collapse=", "))
  message("Try:")
  message("  conda install -c conda-forge -c bioconda r-aplot bioconductor-ggtree r-ape r-ggplot2 r-dplyr r-readr")
  quit(status=1)
}
RS
}

plot_tree(){
  mkdir -p "${WORKDIR}/plot"

  WORKDIR="${WORKDIR}" python - << 'PY'
import os
import pandas as pd
import pathlib

workdir = pathlib.Path(os.environ.get("WORKDIR", "work_wgs_tree"))

acc = pd.read_csv(workdir/"meta/accessions.tsv", sep="\t")
g = (acc.groupby("best_accession")["label"]
       .apply(lambda x: "; ".join(sorted(set(map(str, x)))))
       .reset_index())
g["display"] = g.apply(lambda r: f'{r["label"]} ({r["best_accession"]})', axis=1)
labels = g.rename(columns={"best_accession":"sample"})[["sample","display"]]

# Add REF
labels = pd.concat([labels, pd.DataFrame([{"sample":"REF","display":"REF"}])], ignore_index=True)

out = workdir/"plot/labels.tsv"
out.parent.mkdir(parents=True, exist_ok=True)
labels.to_csv(out, sep="\t", index=False)
print("Wrote:", out)
PY

  cat > "${WORKDIR}/plot/plot_tree.R" << 'RS'
suppressPackageStartupMessages({
  library(ape); library(ggplot2); library(ggtree); library(dplyr); library(readr)
})
args <- commandArgs(trailingOnly=TRUE)
tree_in <- args[1]; labels_tsv <- args[2]; k <- as.integer(args[3])
out_pdf <- args[4]; out_png <- args[5]

tr <- read.tree(tree_in)
lab <- read_tsv(labels_tsv, show_col_types=FALSE)
tipmap <- setNames(lab$display, lab$sample)
tr$tip.label <- ifelse(tr$tip.label %in% names(tipmap), tipmap[tr$tip.label], tr$tip.label)

hc <- as.hclust.phylo(tr)
grp <- cutree(hc, k=k)
grp_df <- tibble(tip=names(grp), clade=paste0("Clade_", grp))

p <- ggtree(tr, layout="rectangular") %<+% grp_df +
  aes(color=clade) +
  geom_tree(linewidth=0.9) +
  geom_tippoint(aes(color=clade), size=2.3) +
  geom_tiplab(aes(color=clade), size=3.1, align=TRUE,
              linetype="dotted", linesize=0.35, offset=0.02) +
  theme_tree2() +
  theme(legend.position="right", legend.title=element_blank(),
        plot.margin=margin(8,18,8,8))
  #      + geom_treescale(x=0, y=0, width=0.01, fontsize=3)
# ---- Manual scale bar (fixed label "0.01") ----
scale_x <- 0
scale_y <- 0
scale_w <- 0.01

p <- p +
  annotate("segment",
           x = scale_x, xend = scale_x + scale_w,
           y = scale_y, yend = scale_y,
           linewidth = 0.6) +
  annotate("text",
           x = scale_x + scale_w/2,
           y = scale_y - 0.6,
           label = "0.01",
           size = 3)
# ----------------------------------------------

ggsave(out_pdf, p, width=9, height=6.5, device="pdf")
ggsave(out_png, p, width=9, height=6.5, dpi=300)
RS

  Rscript "${WORKDIR}/plot/plot_tree.R" \
    "${WORKDIR}/raxmlng/core.raxml.support" \
    "${WORKDIR}/plot/labels.tsv" \
    "${CLUSTERS_K}" \
    "${WORKDIR}/plot/core_tree.pdf" \
    "${WORKDIR}/plot/core_tree.png"

  log "Plot written:"
  log "  ${WORKDIR}/plot/core_tree.pdf"
  log "  ${WORKDIR}/plot/core_tree.png"
}

main(){
  mkdir -p "${WORKDIR}"

  activate_existing_env
  check_dependencies

  if [[ "${MODE}" == "plot-only" ]]; then
    log "Running plot-only mode"
    plot_tree
    log "DONE."
    exit 0
  fi

  log "1) Prepare unique accessions"
  prepare_accessions

  log "2) Download genomes"
  download_genomes

  log "3) Collect FASTAs (+ REF)"
  collect_fastas

  log "4) Prokka"
  run_prokka

  log "4b) GFF sanity check"
  sanitize_and_check_gffs

  log "5) Roary"
  run_roary

  log "6) RAxML-NG"
  run_raxmlng

  #log "6b) Check R packages"
  #ensure_r_pkgs

  log "7) Plot"
  plot_tree

  log "DONE."
}

main "$@"

regenerate_labels.sh

python - <<'PY'
import json, re
from pathlib import Path
import pandas as pd

WORKDIR = Path("work_wgs_tree")
ACC_TSV = WORKDIR / "meta/accessions.tsv"
GENOMES_DIR = WORKDIR / "genomes_ncbi"
OUT = WORKDIR / "plot/labels.tsv"

def first_existing(paths):
    for p in paths:
        if p and Path(p).exists():
            return Path(p)
    return None

def find_metadata_files(acc_dir: Path):
    # NCBI Datasets layouts vary by version; search broadly
    candidates = []
    for pat in [
        "**/assembly_data_report.jsonl",
        "**/data_report.jsonl",
        "**/dataset_catalog.json",
        "**/*assembly_report*.txt",
        "**/*assembly_report*.tsv",
    ]:
        candidates += list(acc_dir.glob(pat))
    # de-dup, keep stable order
    seen = set()
    uniq = []
    for p in candidates:
        if p.as_posix() not in seen:
            uniq.append(p)
            seen.add(p.as_posix())
    return uniq

def parse_jsonl_for_name_and_strain(p: Path):
    # assembly_data_report.jsonl / data_report.jsonl: first JSON object usually has organism info
    try:
        with p.open() as f:
            for line in f:
                line = line.strip()
                if not line:
                    continue
                obj = json.loads(line)
                # Try common fields
                # organismName may appear as:
                # obj["organism"]["organismName"] or obj["organismName"]
                org = None
                strain = None

                if isinstance(obj, dict):
                    if "organism" in obj and isinstance(obj["organism"], dict):
                        org = obj["organism"].get("organismName") or obj["organism"].get("taxName")
                        # isolate/strain can hide in infraspecificNames or isolate/strain keys
                        infra = obj["organism"].get("infraspecificNames") or {}
                        if isinstance(infra, dict):
                            strain = infra.get("strain") or infra.get("isolate")
                        strain = strain or obj["organism"].get("strain") or obj["organism"].get("isolate")

                    org = org or obj.get("organismName") or obj.get("taxName")

                    # Sometimes isolate/strain is nested elsewhere
                    if not strain:
                        # assemblyInfo / assembly / sampleInfo patterns
                        for key in ["assemblyInfo", "assembly", "sampleInfo", "biosample"]:
                            if key in obj and isinstance(obj[key], dict):
                                d = obj[key]
                                strain = strain or d.get("strain") or d.get("isolate")
                                infra = d.get("infraspecificNames")
                                if isinstance(infra, dict):
                                    strain = strain or infra.get("strain") or infra.get("isolate")

                if org:
                    return org, strain
    except Exception:
        pass
    return None, None

def parse_dataset_catalog(p: Path):
    # dataset_catalog.json can include assembly/organism info, but structure varies.
    try:
        obj = json.loads(p.read_text())
    except Exception:
        return None, None

    org = None
    strain = None

    # walk dict recursively looking for likely keys
    def walk(x):
        nonlocal org, strain
        if isinstance(x, dict):
            # organism keys
            if not org:
                if "organismName" in x and isinstance(x["organismName"], str):
                    org = x["organismName"]
                elif "taxName" in x and isinstance(x["taxName"], str):
                    org = x["taxName"]
            # strain/isolate keys
            if not strain:
                for k in ["strain", "isolate"]:
                    if k in x and isinstance(x[k], str) and x[k].strip():
                        strain = x[k].strip()
                        break
            for v in x.values():
                walk(v)
        elif isinstance(x, list):
            for v in x:
                walk(v)

    walk(obj)
    return org, strain

def parse_assembly_report_txt(p: Path):
    # NCBI assembly_report.txt often has lines like: "# Organism name:" and "# Infraspecific name:"
    org = None
    strain = None
    try:
        for line in p.read_text(errors="ignore").splitlines():
            if line.startswith("# Organism name:"):
                org = line.split(":", 1)[1].strip()
            elif line.startswith("# Infraspecific name:"):
                val = line.split(":", 1)[1].strip()
                # e.g. "strain=XXXX" or "isolate=YYYY"
                m = re.search(r"(strain|isolate)\s*=\s*(.+)", val)
                if m:
                    strain = m.group(2).strip()
            if org and strain:
                break
    except Exception:
        pass
    return org, strain

def best_name_for_accession(acc: str):
    acc_dir = GENOMES_DIR / acc
    if not acc_dir.exists():
        return None

    files = find_metadata_files(acc_dir)

    org = None
    strain = None

    # Prefer JSONL reports first
    for p in files:
        if p.name.endswith(".jsonl"):
            org, strain = parse_jsonl_for_name_and_strain(p)
            if org:
                break

    # Next try dataset_catalog.json
    if not org:
        for p in files:
            if p.name == "dataset_catalog.json":
                org, strain = parse_dataset_catalog(p)
                if org:
                    break

    # Finally try assembly report text
    if not org:
        for p in files:
            if "assembly_report" in p.name and p.suffix in [".txt", ".tsv"]:
                org, strain = parse_assembly_report_txt(p)
                if org:
                    break

    if not org:
        return None

    # normalize whitespace
    org = re.sub(r"\s+", " ", org).strip()
    if strain:
        strain = re.sub(r"\s+", " ", str(strain)).strip()
        # avoid duplicating if strain already in organism string
        if strain and strain.lower() not in org.lower():
            return f"{org} {strain}"
    return org

# --- build labels ---
acc = pd.read_csv(ACC_TSV, sep="\t")
if "label" not in acc.columns or "best_accession" not in acc.columns:
    raise SystemExit("accessions.tsv must have columns: label, best_accession")

rows = []

for _, r in acc.iterrows():
    label = str(r["label"])
    accn  = str(r["best_accession"])

    if label.startswith("EXTRA_"):
        nm = best_name_for_accession(accn)
        if nm:
            label = nm
        else:
            # fallback: keep previous behavior if metadata not found
            label = label.replace("EXTRA_", "EXTRA ")

    display = f"{label} ({accn})"
    rows.append({"sample": accn, "display": display})

# Add GE11174 exactly as-is
rows.append({"sample": "GE11174", "display": "GE11174"})

out_df = pd.DataFrame(rows).drop_duplicates(subset=["sample"], keep="first")
OUT.parent.mkdir(parents=True, exist_ok=True)
out_df.to_csv(OUT, sep="\t", index=False)

print("Wrote:", OUT)
print(out_df)
PY

plot_tree_v4.R

suppressPackageStartupMessages({
  library(ape)
  library(readr)
})

args <- commandArgs(trailingOnly = TRUE)
tree_in    <- args[1]
labels_tsv <- args[2]
# args[3] is k (ignored here since all-black)
out_pdf    <- args[4]
out_png    <- args[5]

# --- Load tree ---
tr <- read.tree(tree_in)

# --- Root on outgroup (Brenneria nigrifluens) by accession ---
outgroup_id <- "GCF_005484965.1"
if (outgroup_id %in% tr$tip.label) {
  tr <- root(tr, outgroup = outgroup_id, resolve.root = TRUE)
} else {
  warning("Outgroup tip not found in tree: ", outgroup_id, " (tree will remain unrooted)")
}

# Make plotting order nicer
tr <- ladderize(tr, right = FALSE)

# --- Load labels (columns: sample, display) ---
lab <- read_tsv(labels_tsv, show_col_types = FALSE)
if (!all(c("sample","display") %in% colnames(lab))) {
  stop("labels.tsv must contain columns: sample, display")
}

# Map tip labels AFTER rooting (rooting uses accession IDs)
tipmap <- setNames(lab$display, lab$sample)
tr$tip.label <- ifelse(tr$tip.label %in% names(tipmap),
                       unname(tipmap[tr$tip.label]),
                       tr$tip.label)

# --- Plot helper ---
plot_one <- function(device_fun) {
  device_fun()

  op <- par(no.readonly = TRUE)
  on.exit(par(op), add = TRUE)

  # Bigger right margin for long labels; tighter overall
  par(mar = c(4, 2, 2, 18), xpd = NA)

  # Compute xlim with padding so labels fit but whitespace is limited
  xx <- node.depth.edgelength(tr)
  xmax <- max(xx)
  xpad <- 0.10 * xmax

  plot(tr,
       type = "phylogram",
       use.edge.length = TRUE,
       show.tip.label = TRUE,
       edge.color = "black",
       tip.color  = "black",
       cex = 0.9,            # smaller text -> less overlap
       label.offset = 0.003,  # small gap after tip
       no.margin = FALSE,
       x.lim = c(0, xmax + xpad))

  # Add a clear scale bar near bottom-left
  # Use a fixed fraction of tree length for bar length
  bar_len <- 0.05 * xmax
  add.scale.bar(x = 0, y = 0, length = 0.01, lwd = 2, cex = 0.9)
}

# --- Write outputs (shorter height -> less vertical whitespace) ---
plot_one(function() pdf(out_pdf, width = 11, height = 6, useDingbats = FALSE))
dev.off()

plot_one(function() png(out_png, width = 3000, height = 1000, res = 300))
dev.off()

cat("Wrote:\n", out_pdf, "\n", out_png, "\n", sep = "")

run_fastani_batch_verbose.sh

#!/usr/bin/env bash
set -euo pipefail

# ============ CONFIG ============
QUERY="bacass_out/Prokka/An6/An6.fna"   # 你的 query fasta
ACC_LIST="accessions.txt"              # 每行一个 GCF/GCA
OUTDIR="fastani_batch"
THREADS=8
SUFFIX=".genomic.fna"
# =================================

ts() { date +"%F %T"; }
log() { echo "[$(ts)] $*"; }
die() { echo "[$(ts)] ERROR: $*" >&2; exit 1; }

# --- checks ---
log "Checking required commands..."
for cmd in fastANI awk sort unzip find grep wc head readlink; do
  command -v "$cmd" >/dev/null 2>&1 || die "Missing command: $cmd"
done

command -v datasets >/dev/null 2>&1 || die "Missing NCBI datasets CLI. Install from NCBI Datasets."

[[ -f "$QUERY" ]] || die "QUERY not found: $QUERY"
[[ -f "$ACC_LIST" ]] || die "Accession list not found: $ACC_LIST"

log "QUERY: $QUERY"
log "ACC_LIST: $ACC_LIST"
log "OUTDIR: $OUTDIR"
log "THREADS: $THREADS"

mkdir -p "$OUTDIR/ref_fasta" "$OUTDIR/zips" "$OUTDIR/tmp" "$OUTDIR/logs"

REF_LIST="$OUTDIR/ref_list.txt"
QUERY_LIST="$OUTDIR/query_list.txt"
RAW_OUT="$OUTDIR/fastani_raw.tsv"
FINAL_OUT="$OUTDIR/fastani_results.tsv"
DL_LOG="$OUTDIR/logs/download.log"
ANI_LOG="$OUTDIR/logs/fastani.log"

: > "$REF_LIST"
: > "$DL_LOG"
: > "$ANI_LOG"

# --- build query list ---
q_abs="$(readlink -f "$QUERY")"
echo "$q_abs" > "$QUERY_LIST"
log "Wrote query list: $QUERY_LIST"
log "  -> $q_abs"

# --- download refs ---
log "Downloading reference genomes via NCBI datasets..."
n_ok=0
n_skip=0
while read -r acc; do
  [[ -z "$acc" ]] && continue
  [[ "$acc" =~ ^# ]] && continue

  log "Ref: $acc"
  zip="$OUTDIR/zips/${acc}.zip"
  unpack="$OUTDIR/tmp/$acc"
  out_fna="$OUTDIR/ref_fasta/${acc}${SUFFIX}"

  # download zip
  log "  - datasets download -> $zip"
  if datasets download genome accession "$acc" --include genome --filename "$zip" >>"$DL_LOG" 2>&1; then
    log "  - download OK"
  else
    log "  - download FAILED (see $DL_LOG), skipping $acc"
    n_skip=$((n_skip+1))
    continue
  fi

  # unzip
  rm -rf "$unpack"
  mkdir -p "$unpack"
  log "  - unzip -> $unpack"
  if unzip -q "$zip" -d "$unpack" >>"$DL_LOG" 2>&1; then
    log "  - unzip OK"
  else
    log "  - unzip FAILED (see $DL_LOG), skipping $acc"
    n_skip=$((n_skip+1))
    continue
  fi

  # find genomic.fna (兼容不同包结构:优先找 genomic.fna,其次找任何 .fna)
  fna="$(find "$unpack" -type f \( -name "*genomic.fna" -o -name "*genomic.fna.gz" \) | head -n 1 || true)"
  if [[ -z "${fna:-}" ]]; then
    log "  - genomic.fna not found, try any *.fna"
    fna="$(find "$unpack" -type f -name "*.fna" | head -n 1 || true)"
  fi

  if [[ -z "${fna:-}" ]]; then
    log "  - FAILED to find any .fna in package (see $DL_LOG). skipping $acc"
    n_skip=$((n_skip+1))
    continue
  fi

  # handle gz if needed
  if [[ "$fna" == *.gz ]]; then
    log "  - found gzipped fasta: $(basename "$fna"), gunzip -> $out_fna"
    gunzip -c "$fna" > "$out_fna"
  else
    log "  - found fasta: $(basename "$fna"), copy -> $out_fna"
    cp -f "$fna" "$out_fna"
  fi

  # sanity check fasta looks non-empty
  if [[ ! -s "$out_fna" ]]; then
    log "  - output fasta is empty, skipping $acc"
    n_skip=$((n_skip+1))
    continue
  fi

  echo "$(readlink -f "$out_fna")" >> "$REF_LIST"
  n_ok=$((n_ok+1))
  log "  - saved ref fasta OK"
done < "$ACC_LIST"

log "Download summary: OK=$n_ok, skipped=$n_skip"
log "Ref list written: $REF_LIST ($(wc -l < "$REF_LIST") refs)"
if [[ "$(wc -l < "$REF_LIST")" -eq 0 ]]; then
  die "No references available. Check $DL_LOG"
fi

# --- run fastANI ---
log "Running fastANI..."
log "Command:"
log "  fastANI -ql $QUERY_LIST -rl $REF_LIST -t $THREADS -o $RAW_OUT"

# 重要:不要吞掉错误信息,把 stdout/stderr 进日志
if fastANI -ql "$QUERY_LIST" -rl "$REF_LIST" -t "$THREADS" -o "$RAW_OUT" >>"$ANI_LOG" 2>&1; then
  log "fastANI finished (see $ANI_LOG)"
else
  log "fastANI FAILED (see $ANI_LOG)"
  die "fastANI failed. Inspect $ANI_LOG"
fi

# --- verify raw output ---
if [[ ! -f "$RAW_OUT" ]]; then
  die "fastANI did not create $RAW_OUT. Check $ANI_LOG"
fi
if [[ ! -s "$RAW_OUT" ]]; then
  die "fastANI output is empty ($RAW_OUT). Check $ANI_LOG; also verify fasta validity."
fi

log "fastANI raw output: $RAW_OUT ($(wc -l < "$RAW_OUT") lines)"
log "Sample lines:"
head -n 5 "$RAW_OUT" || true

# --- create final table ---
log "Creating final TSV with header..."
echo -e "Query\tReference\tANI\tMatchedFrag\tTotalFrag" > "$FINAL_OUT"
awk 'BEGIN{OFS="\t"} {print $1,$2,$3,$4,$5}' "$RAW_OUT" >> "$FINAL_OUT"

log "Final results: $FINAL_OUT ($(wc -l < "$FINAL_OUT") lines incl. header)"
log "Top hits (ANI desc):"
tail -n +2 "$FINAL_OUT" | sort -k3,3nr | head -n 10 || true

log "DONE."
log "Logs:"
log "  download log: $DL_LOG"
log "  fastANI log:  $ANI_LOG"

Processing Data_Benjamin_DNAseq_2026_GE11174 — Bacterial WGS pipeline (standalone post)

This post documents the full end-to-end workflow used to process GE11174 from Data_Benjamin_DNAseq_2026_GE11174, covering: database setup, assembly + QC, species identification (k-mers + ANI), genome annotation, generation of summary tables, AMR/resistome/virulence profiling, phylogenetic tree construction (with reproducible plotting), optional debugging/reruns, optional closest-isolate comparison, and a robust approach to find nearest genomes from GenBank.


Overview of the workflow

High-level stages

  1. Prepare databases (KmerFinder DB; Kraken2 DB used by bacass).
  2. Assemble + QC + taxonomic context using nf-core/bacass (Nextflow + Docker).
  3. Interpret KmerFinder results (species hint; confirm with ANI for publication).
  4. Annotate the genome using BV-BRC ComprehensiveGenomeAnalysis.
  5. Generate Table 1 (sequence + assembly + genome features) under gunc_env and export to Excel.
  6. AMR / resistome / virulence profiling using ABRicate (+ MEGARes/CARD/ResFinder/VFDB) and RGI (CARD models), export to Excel.
  7. Build phylogenetic tree (NCBI retrieval + Roary + RAxML-NG + R plotting).
  8. Debug/re-run guidance (drop one genome, rerun Roary→RAxML, regenerate plot).
  9. ANI + BUSCO interpretation (species boundary explanation and QC interpretation).
  10. fastANI interpretation text (tree + ANI combined narrative).
  11. Optional: closest isolate alignment using nf-core/pairgenomealign.
  12. Optional: NCBI submission (batch submission plan).
  13. Robust closest-genome search from GenBank using NCBI datasets + Mash, with duplicate handling (GCA vs GCF).

0) Inputs / assumptions

  • Sample: GE11174
  • Inputs referenced in commands:

    • samplesheet.tsv for bacass
    • targets.tsv for reference selection (tree step)
    • samplesheet.csv for pairgenomealign (closest isolate comparison)
    • raw reads: GE11174.rawreads.fastq.gz
    • assembly FASTA used downstream: GE11174.fasta (and in some places scaffold outputs like scaffolds.fasta / GE11174.scaffolds.fa)
  • Local reference paths (examples used):

    • Kraken2 DB tarball: /mnt/nvme1n1p1/REFs/k2_standard_08_GB_20251015.tar.gz
    • KmerFinder DB: /mnt/nvme1n1p1/REFs/kmerfinder/bacteria/ (or tarball variants shown below)

1) Database setup — KmerFinder DB

Option A (CGE service):

Option B (Zenodo snapshot):

  • Download 20190108_kmerfinder_stable_dirs.tar.gz from: https://zenodo.org/records/13447056

2) Assembly + QC + taxonomic screening — nf-core/bacass

Run bacass with Docker and resume support:

    conda activate /home/jhuang/miniconda3/envs/trycycler
    unicycler -l Gibsiella_species_ONT/GE11174.rawreads.fastq.gz --mode normal -t 80 -o GE11174_unicycler_normal  #{conservative,normal,bold}
    unicycler -l Gibsiella_species_ONT/GE11174.rawreads.fastq.gz --mode conservative -t 80 -o GE11174_unicycler_conservative

conda deactivate
# Example --kmerfinderdb values tried/recorded:
# --kmerfinderdb /path/to/kmerfinder/bacteria.tar.gz
# --kmerfinderdb /mnt/nvme1n1p1/REFs/kmerfinder_db.tar.gz
# --kmerfinderdb /mnt/nvme1n1p1/REFs/20190108_kmerfinder_stable_dirs.tar.gz
nextflow run nf-core/bacass -r 2.5.0 -profile docker \
  --input samplesheet.tsv \
  --outdir bacass_out \
  --assembly_type long \
  --kraken2db /mnt/nvme1n1p1/REFs/k2_standard_08_GB_20251015.tar.gz \
  --kmerfinderdb /mnt/nvme1n1p1/REFs/kmerfinder/bacteria/ \
  -resume

Outputs used later

  • Scaffolded/assembled FASTA from bacass (e.g., for annotation, AMR screening, Mash sketching, tree building).

3) KmerFinder summary — species hint (with publication note)

Interpretation recorded:

From the KmerFinder summary, the top hit is Gibbsiella quercinecans (strain FRB97; NZ_CP014136.1) with much higher score and coverage than the second hit (which is low coverage). So it’s fair to write: “KmerFinder indicates the isolate is most consistent with Gibbsiella quercinecans.” …but for a species call (especially for publication), you should confirm with ANI (or a genome taxonomy tool), because k-mer hits alone aren’t always definitive.


4) Genome annotation — BV-BRC ComprehensiveGenomeAnalysis

Annotate the genome using BV-BRC:


5) Table 1 — Summary of sequence data and genome features (env: gunc_env)

Prepare environment and run the Table 1 pipeline:

# activate the env that has openpyxl
mamba activate gunc_env
mamba install -n gunc_env -c conda-forge openpyxl -y
mamba deactivate

# STEP_1
ENV_NAME=gunc_env AUTO_INSTALL=1 THREADS=32 ~/Scripts/make_table1_GE11174.sh

# STEP_2
python export_table1_stats_to_excel_py36_compat.py \
  --workdir table1_GE11174_work \
  --out Comprehensive_GE11174.xlsx \
  --max-rows 200000 \
  --sample GE11174

# STEP_1+2 (combined)
ENV_NAME=gunc_env AUTO_INSTALL=1 THREADS=32 ~/Scripts/make_table1_with_excel.sh

For the items “Total number of reads sequenced” and “Mean read length (bp)”:

pigz -dc GE11174.rawreads.fastq.gz | awk 'END{print NR/4}'
seqkit stats GE11174.rawreads.fastq.gz

6) AMR gene profiling + Resistome + Virulence profiling (ABRicate + RGI)

This stage produces resistome/virulence tables and an Excel export.

6.1 Databases / context notes

“Table 4. Specialty Genes” note recorded:

  • NDARO: 1
  • Antibiotic Resistance — CARD: 15
  • Antibiotic Resistance — PATRIC: 55
  • Drug Target — TTD: 38
  • Metal Resistance — BacMet: 29
  • Transporter — TCDB: 250
  • Virulence factor — VFDB: 33

Useful sites:

6.2 ABRicate environment + DB listing

conda activate /home/jhuang/miniconda3/envs/bengal3_ac3

abricate --list
#DATABASE        SEQUENCES       DBTYPE  DATE
#vfdb    2597    nucl    2025-Oct-22
#resfinder       3077    nucl    2025-Oct-22
#argannot        2223    nucl    2025-Oct-22
#ecoh    597     nucl    2025-Oct-22
#megares 6635    nucl    2025-Oct-22
#card    2631    nucl    2025-Oct-22
#ecoli_vf        2701    nucl    2025-Oct-22
#plasmidfinder   460     nucl    2025-Oct-22
#ncbi    5386    nucl    2025-Oct-22

abricate-get_db  --list
#Choices: argannot bacmet2 card ecoh ecoli_vf megares ncbi plasmidfinder resfinder vfdb victors (default '').

6.3 Install/update DBs (CARD, MEGARes)

# CARD
abricate-get_db --db card

# MEGARes (automatically install; if error try manual install below)
abricate-get_db --db megares

6.4 Manual MEGARes v3.0 install (if needed)

wget -O megares_database_v3.00.fasta \
  "https://www.meglab.org/downloads/megares_v3.00/megares_database_v3.00.fasta"

# 1) Define dbdir (adjust to your env; from logs it's inside the conda env)
DBDIR=/home/jhuang/miniconda3/envs/bengal3_ac3/db

# 2) Create a custom db folder for MEGARes v3.0
mkdir -p ${DBDIR}/megares_v3.0

# 3) Copy the downloaded MEGARes v3.0 nucleotide FASTA to 'sequences'
cp megares_database_v3.00.fasta ${DBDIR}/megares_v3.0/sequences

# 4) Build ABRicate indices
abricate --setupdb

# Confirm presence
abricate --list | egrep 'card|megares'
abricate --list | grep -i megares

6.5 Run resistome/virulome pipeline scripts

chmod +x run_resistome_virulome_dedup.sh
ENV_NAME=/home/jhuang/miniconda3/envs/bengal3_ac3 ASM=GE11174.fasta SAMPLE=GE11174 THREADS=32 ./run_resistome_virulome_dedup.sh
ENV_NAME=/home/jhuang/miniconda3/envs/bengal3_ac3 ASM=./vrap_HF/spades/scaffolds.fasta SAMPLE=HF THREADS=32 ~/Scripts/run_resistome_virulome_dedup.sh
ENV_NAME=/home/jhuang/miniconda3/envs/bengal3_ac3 ASM=GE11174.fasta SAMPLE=GE11174 MINID=80 MINCOV=60 ./run_resistome_virulome_dedup.sh

6.6 Sanity checks on ABRicate outputs

grep -vc '^#' resistome_virulence_GE11174/raw/GE11174.megares.tab
grep -vc '^#' resistome_virulence_GE11174/raw/GE11174.card.tab
grep -vc '^#' resistome_virulence_GE11174/raw/GE11174.resfinder.tab
grep -vc '^#' resistome_virulence_GE11174/raw/GE11174.vfdb.tab

grep -v '^#' resistome_virulence_GE11174/raw/GE11174.megares.tab | grep -v '^[[:space:]]*$' | head -n 3
grep -v '^#' resistome_virulence_GE11174/raw/GE11174.card.tab | grep -v '^[[:space:]]*$' | head -n 3
grep -v '^#' resistome_virulence_GE11174/raw/GE11174.resfinder.tab | grep -v '^[[:space:]]*$' | head -n 3
grep -v '^#' resistome_virulence_GE11174/raw/GE11174.vfdb.tab | grep -v '^[[:space:]]*$' | head -n 3

6.7 Dedup tables / “one per gene” mode

chmod +x run_abricate_resistome_virulome_one_per_gene.sh
ENV_NAME=/home/jhuang/miniconda3/envs/bengal3_ac3 \
ASM=GE11174.fasta \
SAMPLE=GE11174 \
OUTDIR=resistome_virulence_GE11174 \
MINID=80 MINCOV=60 \
THREADS=32 \
./run_abricate_resistome_virulome_one_per_gene.sh

Threshold summary recorded:

  • ABRicate thresholds: MINID=70 MINCOV=50

    • MEGARes: 35 → resistome_virulence_GE11174/raw/GE11174.megares.tab
    • CARD: 28 → resistome_virulence_GE11174/raw/GE11174.card.tab
    • ResFinder: 2 → resistome_virulence_GE11174/raw/GE11174.resfinder.tab
    • VFDB: 18 → resistome_virulence_GE11174/raw/GE11174.vfdb.tab
  • ABRicate thresholds: MINID=80 MINCOV=60

    • MEGARes: 3
    • CARD: 1
    • ResFinder: 0
    • VFDB: 0

6.8 Merge sources + export to Excel

python merge_amr_sources_by_gene.py

python export_resistome_virulence_to_excel_py36.py \
  --workdir resistome_virulence_GE11174 \
  --sample GE11174 \
  --out Resistome_Virulence_GE11174.xlsx

6.9 Methods sentence + table captions (recorded text)

Methods sentence (AMR + virulence)

AMR genes were identified by screening the genome assembly with ABRicate against the MEGARes and ResFinder databases, using minimum identity and coverage thresholds of X% and Y%, respectively. CARD-based AMR determinants were additionally predicted using RGI (Resistance Gene Identifier) to leverage curated resistance models. Virulence factors were screened using ABRicate against VFDB under the same thresholds. Replace X/Y with your actual values (e.g., 90/60) or state “default parameters” if you truly used defaults.

Table 2 caption (AMR)

Table 2. AMR gene profiling of the genome assembly. Hits were detected using ABRicate (MEGARes and ResFinder) and RGI (CARD). The presence of AMR-associated genes does not necessarily imply phenotypic resistance, which may depend on allele type, genomic context/expression, and/or SNP-mediated mechanisms; accordingly, phenotype predictions (e.g., ResFinder) should be interpreted cautiously.

Table 3 caption (virulence)

Table 3. Virulence factor profiling of the genome assembly based on ABRicate screening against VFDB, reporting loci with sequence identity and coverage above the specified thresholds.


7) Phylogenetic tree generation (Nextflow/NCBI + Roary + RAxML-NG + R plotting)

7.1 Resolve/choose assemblies via Entrez

export NCBI_EMAIL="x.yyy@zzz.de"
./resolve_best_assemblies_entrez.py targets.tsv resolved_accessions.tsv

7.2 Build tree (main pipeline) + note about R env

Recorded note:

NOTE the env bengal3_ac3 don’t have the following R package, using r_env for the plot-step → RUN TWICE, first bengal3_ac3, then run build_wgs_tree_fig3B.sh plot-only.

Suggested package install (if needed):

#mamba install -y -c conda-forge -c bioconda r-aplot bioconductor-ggtree r-ape r-ggplot2 r-dplyr r-readr

Run:

export ENV_NAME=/home/jhuang/miniconda3/envs/bengal3_ac3
export NCBI_EMAIL="x.yyy@zzz.de"
./build_wgs_tree_fig3B.sh

# Regenerate the plot
conda activate r_env
./build_wgs_tree_fig3B.sh plot-only

7.3 Manual label corrections

Edit:

  • vim work_wgs_tree/plot/labels.tsv

Recorded edits:

  • REMOVE:

    • GCA_032062225.1 EXTRA_GCA_032062225.1 (GCA_032062225.1)
    • GCF_047901425.1 EXTRA_GCF_047901425.1 (GCF_047901425.1)
  • ADAPT:

    • Gibbsiella quercinecans DSM 25889 (GCF_004342245.1)
    • Gibbsiella greigii USA56
    • Gibbsiella papilionis PWX6
    • Gibbsiella quercinecans strain FRB97
    • Brenneria nigrifluens LMG 5956

7.4 Plot with plot_tree_v4.R

Rscript work_wgs_tree/plot/plot_tree_v4.R \
  work_wgs_tree/raxmlng/core.raxml.support \
  work_wgs_tree/plot/labels.tsv \
  6 \
  work_wgs_tree/plot/core_tree.pdf \
  work_wgs_tree/plot/core_tree.png

8) DEBUG rerun recipe (drop one genome; rerun Roary → RAxML-NG → plot)

Example: drop GCF_047901425.1 (or the other listed one).

8.1 Remove from inputs

# 1.1) remove from inputs so Roary cannot include it
rm -f work_wgs_tree/gffs/GCF_047901425.1.gff
rm -f work_wgs_tree/fastas/GCF_047901425.1.fna
rm -rf work_wgs_tree/prokka/GCF_047901425.1
rm -rf work_wgs_tree/genomes_ncbi/GCF_047901425.1  #optional

# 1.2) remove from accession list so it won't come back
awk -F'\t' 'NR==1 || $2!="GCF_047901425.1"' work_wgs_tree/meta/accessions.tsv > work_wgs_tree/meta/accessions.tsv.tmp \
  && mv work_wgs_tree/meta/accessions.tsv.tmp work_wgs_tree/meta/accessions.tsv

Alternative removal target:

# 2.1) remove from inputs so Roary cannot include it
rm -f work_wgs_tree/gffs/GCA_032062225.1.gff
rm -f work_wgs_tree/fastas/GCA_032062225.1.fna
rm -rf work_wgs_tree/prokka/GCA_032062225.1
rm -rf work_wgs_tree/genomes_ncbi/GCA_032062225.1  #optional

# 2.2) remove from accession list so it won't come back
awk -F'\t' 'NR==1 || $2!="GCA_032062225.1"' work_wgs_tree/meta/accessions.tsv > work_wgs_tree/meta/accessions.tsv.tmp \
  && mv work_wgs_tree/meta/accessions.tsv.tmp work_wgs_tree/meta/accessions.tsv

8.2 Clean old runs + rerun Roary

# 3) delete old roary runs (so you don't accidentally reuse old alignment)
rm -rf work_wgs_tree/roary_*

# 4) rerun Roary (fresh output dir)
mkdir -p work_wgs_tree/logs
ROARY_OUT="work_wgs_tree/roary_$(date +%s)"
roary -e --mafft -p 8 -cd 95 -i 95 \
  -f "$ROARY_OUT" \
  work_wgs_tree/gffs/*.gff \
  > work_wgs_tree/logs/roary_rerun.stdout.txt \
  2> work_wgs_tree/logs/roary_rerun.stderr.txt

8.3 Point to the new core alignment and rerun RAxML-NG

# 5) point meta file to new core alignment (absolute path)
echo "$(readlink -f "$ROARY_OUT/core_gene_alignment.aln")" > work_wgs_tree/meta/core_alignment_path.txt

# 6) rerun RAxML-NG
rm -rf work_wgs_tree/raxmlng
mkdir work_wgs_tree/raxmlng/
raxml-ng --all \
  --msa "$(cat work_wgs_tree/meta/core_alignment_path.txt)" \
  --model GTR+G \
  --bs-trees 1000 \
  --threads 8 \
  --prefix work_wgs_tree/raxmlng/core

8.4 Regenerate labels + replot

# 7) Run this to regenerate labels.tsv
bash regenerate_labels.sh

# 8) Manual correct the display name in vim work_wgs_tree/plot/labels.tsv
#Gibbsiella greigii USA56
#Gibbsiella papilionis PWX6
#Gibbsiella quercinecans strain FRB97
#Brenneria nigrifluens LMG 5956

# 9) Rerun only the plot step:
Rscript work_wgs_tree/plot/plot_tree.R \
  work_wgs_tree/raxmlng/core.raxml.support \
  work_wgs_tree/plot/labels.tsv \
  6 \
  work_wgs_tree/plot/core_tree.pdf \
  work_wgs_tree/plot/core_tree.png

9) fastaANI + BUSCO explanations (recorded)

9.1 Reference FASTA inventory example

find . -name "*.fna"
#./work_wgs_tree/fastas/GCF_004342245.1.fna  GCF_004342245.1 Gibbsiella quercinecans DSM 25889 (GCF_004342245.1)
#./work_wgs_tree/fastas/GCF_039539505.1.fna  GCF_039539505.1 Gibbsiella papilionis PWX6 (GCF_039539505.1)
#./work_wgs_tree/fastas/GCF_005484965.1.fna  GCF_005484965.1 Brenneria nigrifluens LMG5956 (GCF_005484965.1)
#./work_wgs_tree/fastas/GCA_039540155.1.fna  GCA_039540155.1 Gibbsiella greigii USA56 (GCA_039540155.1)
#./work_wgs_tree/fastas/GE11174.fna
#./work_wgs_tree/fastas/GCF_002291425.1.fna  GCF_002291425.1 Gibbsiella quercinecans FRB97 (GCF_002291425.1)

9.2 fastANI runs

mamba activate /home/jhuang/miniconda3/envs/bengal3_ac3

fastANI \
  -q GE11174.fasta \
  -r ./work_wgs_tree/fastas/GCF_004342245.1.fna \
  -o fastANI_out_Gibbsiella_quercinecans_DSM_25889.txt

fastANI \
  -q GE11174.fasta \
  -r ./work_wgs_tree/fastas/GCF_039539505.1.fna \
  -o fastANI_out_Gibbsiella_papilionis_PWX6.txt

fastANI \
  -q GE11174.fasta \
  -r ./work_wgs_tree/fastas/GCF_005484965.1.fna \
  -o fastANI_out_Brenneria_nigrifluens_LMG5956.txt

fastANI \
  -q GE11174.fasta \
  -r ./work_wgs_tree/fastas/GCA_039540155.1.fna \
  -o fastANI_out_Gibbsiella_greigii_USA56.txt

fastANI \
  -q GE11174.fasta \
  -r ./work_wgs_tree/fastas/GCF_002291425.1.fna \
  -o fastANI_out_Gibbsiella_quercinecans_FRB97.txt

cat fastANI_out_*.txt > fastANI_out.txt

9.3 fastANI output table (recorded)

GE11174.fasta   ./work_wgs_tree/fastas/GCF_005484965.1.fna      79.1194 597     1890
GE11174.fasta   ./work_wgs_tree/fastas/GCA_039540155.1.fna      95.9589 1547    1890
GE11174.fasta   ./work_wgs_tree/fastas/GCF_039539505.1.fna      97.2172 1588    1890
GE11174.fasta   ./work_wgs_tree/fastas/GCF_004342245.1.fna      98.0889 1599    1890
GE11174.fasta   ./work_wgs_tree/fastas/GCF_002291425.1.fna      98.1285 1622    1890

9.4 Species boundary note (recorded, bilingual)

在细菌基因组比较里,一个常用经验阈值是:

  • ANI ≥ 95–96%:通常认为属于同一物种(species)的范围
  • 你这里 97.09% → 很大概率表示 An6 与 HITLi7 属于同一物种,但可能不是同一株(strain),因为还存在一定差异。

是否“同一菌株”通常还要结合:

  • 核心基因 SNP 距离、cgMLST
  • 组装质量/污染
  • 覆盖率是否足够高

9.5 BUSCO results interpretation (recorded)

BUSCO 结果的快速解读(顺便一句). The results have been already packaged in the Table 1.

  • Complete 99.2%,Missing 0.0%:说明你的组装非常完整(对细菌来说很优秀)
  • Duplicated 0.0%:重复拷贝不高,污染/混样风险更低
  • Scaffolds 80、N50 ~169 kb:碎片化还可以,但总体质量足以做 ANI/物种鉴定

10) fastANI explanation (recorded narrative)

From your tree and the fastANI table, GE11174 is clearly inside the Gibbsiella quercinecans clade, and far from the outgroup (Brenneria nigrifluens). The ANI values quantify that same pattern.

1) Outgroup check (sanity)

  • GE11174 vs Brenneria nigrifluens (GCF_005484965.1): ANI 79.12% (597/1890 fragments)

    • 79% ANI is way below any species boundary → not the same genus/species.
    • On the tree, Brenneria sits on a long branch as the outgroup, consistent with this deep divergence.
    • The relatively low matched fragments (597/1890) also fits “distant genomes” (fewer orthologous regions pass the ANI mapping filters).

2) Species-level placement of GE11174

A common rule of thumb you quoted is correct: ANI ≥ 95–96% ⇒ same species.

Compare GE11174 to the Gibbsiella references:

  • vs GCA_039540155.1 (Gibbsiella greigii USA56): 95.96% (1547/1890)

    • Right at the boundary. This suggests “close but could be different species” or “taxonomy/labels may not reflect true species boundaries” depending on how those genomes are annotated.
    • On the tree, G. greigii is outside the quercinecans group but not hugely far, which matches “borderline ANI”.
  • vs GCF_039539505.1 (Gibbsiella papilionis PWX6): 97.22% (1588/1890)

  • vs GCF_004342245.1 (G. quercinecans DSM 25889): 98.09% (1599/1890)

  • vs GCF_002291425.1 (G. quercinecans FRB97): 98.13% (1622/1890)

These are all comfortably above 96%, especially the two quercinecans genomes (~98.1%). That strongly supports:

GE11174 belongs to the same species as Gibbsiella quercinecans (and is closer to quercinecans references than to greigii).

3) Closest reference and “same strain?” question

GE11174’s closest by ANI in your list is:

  • FRB97 (GCF_002291425.1): 98.1285%
  • DSM 25889 (GCF_004342245.1): 98.0889%
  • Next: PWX6 97.2172%

These differences are small, but 98.1% ANI is not “same strain” evidence by itself. Within a species, different strains commonly sit anywhere from ~96–99.9% ANI depending on diversity. To claim “same strain / very recent transmission,” people usually look for much tighter genome-wide similarity:

  • core-genome SNP distance (often single digits to tens, depending on organism and context)
  • cgMLST allele differences
  • recombination filtering (if relevant)
  • assembly QC/contamination checks
  • and confirming that alignment/ANI coverage is high and not biased by missing regions

Your fragment matches (e.g., 1622/1890 for FRB97) are reasonably high, supporting that the ANI is meaningful, but it still doesn’t equate to “same strain.”

4) How to phrase the combined interpretation (tree + ANI)

A clear summary you can use:

  • The phylogenetic tree places GE11174 within the Gibbsiella quercinecans lineage, with Brenneria nigrifluens as a distant outgroup.
  • fastANI supports this:

    • ~98.1% ANI to G. quercinecans FRB97 and DSM 25889 → strong same-species support.
    • 97.2% to G. papilionis → still same-species range, but more distant than the quercinecans references.
    • 95.96% to G. greigii → borderline; consistent with being a close neighboring lineage but not the best species match for GE11174.
    • 79.1% to Brenneria → confirms it is an appropriate outgroup and far outside the species/genus boundary.
  • Therefore, GE11174 is very likely Gibbsiella quercinecans (species-level), and appears most similar to FRB97/DSM 25889, but additional high-resolution analyses are required to assess “same strain.”

If you want, paste the fastANI line for GE11174 vs GE11174? (or run fastANI against a broader set), and/or your core SNP distances from the Roary alignment, and I can help you write a tighter “strain-level” conclusion.


11) Compare to the next closest isolate (pairwise alignment) — nf-core/pairgenomealign

conda deactivate

nextflow run nf-core/pairgenomealign -r 2.2.2 -profile docker \
  --target GE11174.fasta \
  --input samplesheet.csv \
  --outdir pairgenomealign_out \
  --igenomes_base /mnt/nvme1n1p1/igenomes \
  --genome GRCh38

Recorded TODO:

#TODO_NEXT_MONDAY: * phylogenetic tree + fastaANI + nf-core/pairgenomealign (compare to the closest isoalte https://nf-co.re/pairgenomealign/2.2.1/)

  • summarize all results with a mail to send them back, mentioned that we can submit the genome to NCBI to obtain a high-quality annotation. What strain name would you like to assign to this isolate?
  • If they agree, I can submit the two new isolates to the NCBI-database!

12) Submit both sequences in a batch to NCBI-server! (planned step)

Recorded as:

  1. submit both sequences in a batch to NCBI-server!

13) Find the closest isolate from GenBank (robust approach) for STEP_7

13.1 Download all available Gibbsiella genomes

# download all available genomes for the genus Gibbsiella (includes assemblies + metadata)
# --assembly-level: must be 'chromosome', 'complete', 'contig', 'scaffold'
datasets download genome taxon Gibbsiella --include genome,gff3,gbff --assembly-level complete,chromosome,scaffold --filename gibbsiella.zip
unzip -q gibbsiella.zip -d gibbsiella_ncbi

mamba activate /home/jhuang/miniconda3/envs/bengal3_ac3

13.2 Mash sketching + nearest neighbors

# make a Mash sketch of your isolate
mash sketch -o isolate bacass_out/Unicycler/GE11174.scaffolds.fa

# sketch all reference genomes (example path—adjust)
find gibbsiella_ncbi -name "*.fna" -o -name "*.fasta" > refs.txt
mash sketch -o refs -l refs.txt

# get closest genomes
mash dist isolate.msh refs.msh | sort -gk3 | head -n 20 > top20_mash.txt

Recorded interpretation:

  • Best hits to GE11174.scaffolds.fa are:

    • GCA/GCF_002291425.1 (GenBank + RefSeq copies of the same assembly)
    • GCA/GCF_004342245.1 (same duplication pattern)
    • GCA/GCF_047901425.1 (FRB97; also duplicated)
  • Mash distances ~0.018–0.020 are very close (typically same species; often within-species).
  • p-values are underflow formatting (extremely significant).

13.3 Remove duplicates (GCA vs GCF)

Goal: keep one of each duplicated assembly (prefer GCF if available).

Example snippet recorded:

# Take your top hits, prefer GCF over GCA
cat top20_mash.txt \
  | awk '{print $2}' \
  | sed 's|/GCA_.*||; s|/GCF_.*||' \
  | sort -u

Manual suggestion recorded:

  • keep GCF_002291425.1 (drop GCA_002291425.1)
  • keep GCF_004342245.1
  • keep GCF_047901425.1
  • optionally keep GCA_032062225.1 if it’s truly different and you want a more distant ingroup point

Appendix — Complete attached code (standalone)

Below are the full contents of the attached scripts exactly as provided, so this post can be used standalone in the future.

Note: You mentioned “keep all information of input” and “attach all complete code at the end.” I’ve included all scripts that are currently attached in this chat. If there are additional scripts you meant to attach (e.g., run_resistome_virulome*.sh, regenerate_labels.sh, export_table1_stats_to_excel_py36_compat.py, etc.), please upload them and I’ll append them here too.


File: build_wgs_tree_fig3B.sh

#!/usr/bin/env bash
set -euo pipefail

# build_wgs_tree_fig3B.sh
#
# Purpose:
#   Build a core-genome phylogenetic tree and a publication-style plot similar to Fig 3B.
#
# Usage:
#   ./build_wgs_tree_fig3B.sh            # full run
#   ./build_wgs_tree_fig3B.sh plot-only  # only regenerate the plot from existing outputs
#
# Requirements:
#   - Conda env with required tools. Set ENV_NAME to conda env path.
#   - NCBI datasets and/or Entrez usage requires NCBI_EMAIL.
#   - Roary, Prokka, RAxML-NG, MAFFT, R packages for plotting.
#
# Environment variables:
#   ENV_NAME      : path to conda env (e.g., /home/jhuang/miniconda3/envs/bengal3_ac3)
#   NCBI_EMAIL    : email for Entrez calls
#   THREADS       : default threads
#
# Inputs:
#   - targets.tsv: list of target accessions (if used in resolve step)
#   - local isolate genome fasta
#
# Outputs:
#   work_wgs_tree/...
#
# NOTE:
#   If plotting packages are missing in ENV_NAME, run plot-only under an R-capable env (e.g., r_env).

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
MODE="${1:-full}"

THREADS="${THREADS:-8}"
WORKDIR="${WORKDIR:-work_wgs_tree}"

# Activate conda env if provided
if [[ -n "${ENV_NAME:-}" ]]; then
  # shellcheck disable=SC1090
  source "$(conda info --base)/etc/profile.d/conda.sh"
  conda activate "${ENV_NAME}"
fi

mkdir -p "${WORKDIR}"
mkdir -p "${WORKDIR}/logs"

log() {
  echo "[$(date '+%F %T')] $*" >&2
}

# ------------------------------------------------------------------------------
# Helper: check command exists
need_cmd() {
  command -v "$1" >/dev/null 2>&1 || {
    echo "ERROR: required command '$1' not found in PATH" >&2
    exit 1
  }
}

# ------------------------------------------------------------------------------
# Tool checks (plot-only skips some)
if [[ "${MODE}" != "plot-only" ]]; then
  need_cmd python
  need_cmd roary
  need_cmd raxml-ng
  need_cmd prokka
  need_cmd mafft
  need_cmd awk
  need_cmd sed
  need_cmd grep
fi

need_cmd Rscript

# ------------------------------------------------------------------------------
# Paths
META_DIR="${WORKDIR}/meta"
GENOMES_DIR="${WORKDIR}/genomes_ncbi"
FASTAS_DIR="${WORKDIR}/fastas"
GFFS_DIR="${WORKDIR}/gffs"
PROKKA_DIR="${WORKDIR}/prokka"
ROARY_DIR="${WORKDIR}/roary"
RAXML_DIR="${WORKDIR}/raxmlng"
PLOT_DIR="${WORKDIR}/plot"

mkdir -p "${META_DIR}" "${GENOMES_DIR}" "${FASTAS_DIR}" "${GFFS_DIR}" "${PROKKA_DIR}" "${ROARY_DIR}" "${RAXML_DIR}" "${PLOT_DIR}"

ACCESSIONS_TSV="${META_DIR}/accessions.tsv"
LABELS_TSV="${PLOT_DIR}/labels.tsv"
CORE_ALIGN_PATH_FILE="${META_DIR}/core_alignment_path.txt"

# ------------------------------------------------------------------------------
# Step: plot only
if [[ "${MODE}" == "plot-only" ]]; then
  log "Running in plot-only mode..."

  # If labels file isn't present, try generating a minimal one
  if [[ ! -s "${LABELS_TSV}" ]]; then
    log "labels.tsv not found. Creating a placeholder labels.tsv (edit as needed)."
    {
      echo -e "accession\tdisplay"
      if [[ -d "${FASTAS_DIR}" ]]; then
        for f in "${FASTAS_DIR}"/*.fna "${FASTAS_DIR}"/*.fa "${FASTAS_DIR}"/*.fasta 2>/dev/null; do
          [[ -e "$f" ]] || continue
          bn="$(basename "$f")"
          acc="${bn%%.*}"
          echo -e "${acc}\t${acc}"
        done
      fi
    } > "${LABELS_TSV}"
  fi

  # Plot using plot_tree_v4.R if present; otherwise fall back to plot_tree.R
  PLOT_SCRIPT="${PLOT_DIR}/plot_tree_v4.R"
  if [[ ! -f "${PLOT_SCRIPT}" ]]; then
    PLOT_SCRIPT="${SCRIPT_DIR}/plot_tree_v4.R"
  fi

  if [[ ! -f "${PLOT_SCRIPT}" ]]; then
    echo "ERROR: plot_tree_v4.R not found" >&2
    exit 1
  fi

  SUPPORT_FILE="${RAXML_DIR}/core.raxml.support"
  if [[ ! -f "${SUPPORT_FILE}" ]]; then
    echo "ERROR: Support file not found: ${SUPPORT_FILE}" >&2
    exit 1
  fi

  OUTPDF="${PLOT_DIR}/core_tree.pdf"
  OUTPNG="${PLOT_DIR}/core_tree.png"
  ROOT_N=6

  log "Plotting tree..."
  Rscript "${PLOT_SCRIPT}" \
    "${SUPPORT_FILE}" \
    "${LABELS_TSV}" \
    "${ROOT_N}" \
    "${OUTPDF}" \
    "${OUTPNG}"

  log "Done (plot-only). Outputs: ${OUTPDF} ${OUTPNG}"
  exit 0
fi

# ------------------------------------------------------------------------------
# Full pipeline
log "Running full pipeline..."

# ------------------------------------------------------------------------------
# Config / expected inputs
TARGETS_TSV="${SCRIPT_DIR}/targets.tsv"
RESOLVED_TSV="${SCRIPT_DIR}/resolved_accessions.tsv"
ISOLATE_FASTA="${SCRIPT_DIR}/GE11174.fasta"

# If caller has different locations, let them override
TARGETS_TSV="${TARGETS_TSV_OVERRIDE:-${TARGETS_TSV}}"
RESOLVED_TSV="${RESOLVED_TSV_OVERRIDE:-${RESOLVED_TSV}}"
ISOLATE_FASTA="${ISOLATE_FASTA_OVERRIDE:-${ISOLATE_FASTA}}"

# ------------------------------------------------------------------------------
# Step 1: Resolve best assemblies (if targets.tsv exists)
if [[ -f "${TARGETS_TSV}" ]]; then
  log "Resolving best assemblies from targets.tsv..."
  if [[ -z "${NCBI_EMAIL:-}" ]]; then
    echo "ERROR: NCBI_EMAIL is required for Entrez calls" >&2
    exit 1
  fi
  python "${SCRIPT_DIR}/resolve_best_assemblies_entrez.py" "${TARGETS_TSV}" "${RESOLVED_TSV}"
else
  log "targets.tsv not found; assuming resolved_accessions.tsv already exists"
fi

if [[ ! -s "${RESOLVED_TSV}" ]]; then
  echo "ERROR: resolved_accessions.tsv not found or empty: ${RESOLVED_TSV}" >&2
  exit 1
fi

# ------------------------------------------------------------------------------
# Step 2: Prepare accessions.tsv for downstream steps
log "Preparing accessions.tsv..."
{
  echo -e "label\taccession"
  awk -F'\t' 'NR>1 {print $1"\t"$2}' "${RESOLVED_TSV}"
} > "${ACCESSIONS_TSV}"

# ------------------------------------------------------------------------------
# Step 3: Download genomes (NCBI datasets if available)
log "Downloading genomes (if needed)..."
need_cmd datasets
need_cmd unzip

while IFS=$'\t' read -r label acc; do
  [[ "${label}" == "label" ]] && continue
  [[ -z "${acc}" ]] && continue

  OUTZIP="${GENOMES_DIR}/${acc}.zip"
  OUTDIR="${GENOMES_DIR}/${acc}"

  if [[ -d "${OUTDIR}" ]]; then
    log "  ${acc}: already downloaded"
    continue
  fi

  log "  ${acc}: downloading..."
  datasets download genome accession "${acc}" --include genome --filename "${OUTZIP}" \
    > "${WORKDIR}/logs/datasets_${acc}.stdout.txt" \
    2> "${WORKDIR}/logs/datasets_${acc}.stderr.txt" || {
      echo "ERROR: datasets download failed for ${acc}. See logs." >&2
      exit 1
    }

  mkdir -p "${OUTDIR}"
  unzip -q "${OUTZIP}" -d "${OUTDIR}"
done < "${ACCESSIONS_TSV}"

# ------------------------------------------------------------------------------
# Step 4: Collect FASTA files
log "Collecting FASTA files..."
rm -f "${FASTAS_DIR}"/* 2>/dev/null || true

while IFS=$'\t' read -r label acc; do
  [[ "${label}" == "label" ]] && continue
  OUTDIR="${GENOMES_DIR}/${acc}"
  fna="$(find "${OUTDIR}" -name "*.fna" | head -n 1 || true)"
  if [[ -z "${fna}" ]]; then
    echo "ERROR: no .fna found for ${acc} in ${OUTDIR}" >&2
    exit 1
  fi
  cp -f "${fna}" "${FASTAS_DIR}/${acc}.fna"
done < "${ACCESSIONS_TSV}"

# Add isolate
if [[ -f "${ISOLATE_FASTA}" ]]; then
  cp -f "${ISOLATE_FASTA}" "${FASTAS_DIR}/GE11174.fna"
else
  log "WARNING: isolate fasta not found at ${ISOLATE_FASTA}; skipping"
fi

# ------------------------------------------------------------------------------
# Step 5: Run Prokka on each genome
log "Running Prokka..."
for f in "${FASTAS_DIR}"/*.fna; do
  bn="$(basename "${f}")"
  acc="${bn%.fna}"
  out="${PROKKA_DIR}/${acc}"

  if [[ -d "${out}" && -s "${out}/${acc}.gff" ]]; then
    log "  ${acc}: prokka output exists"
    continue
  fi

  mkdir -p "${out}"
  log "  ${acc}: prokka..."
  prokka --outdir "${out}" --prefix "${acc}" --cpus "${THREADS}" "${f}" \
    > "${WORKDIR}/logs/prokka_${acc}.stdout.txt" \
    2> "${WORKDIR}/logs/prokka_${acc}.stderr.txt"
done

# ------------------------------------------------------------------------------
# Step 6: Collect GFFs for Roary
log "Collecting GFFs..."
rm -f "${GFFS_DIR}"/*.gff 2>/dev/null || true
for d in "${PROKKA_DIR}"/*; do
  [[ -d "${d}" ]] || continue
  acc="$(basename "${d}")"
  gff="${d}/${acc}.gff"
  if [[ -f "${gff}" ]]; then
    cp -f "${gff}" "${GFFS_DIR}/${acc}.gff"
  else
    log "WARNING: missing GFF for ${acc}"
  fi
done

# ------------------------------------------------------------------------------
# Step 7: Roary
log "Running Roary..."
ROARY_OUT="${WORKDIR}/roary_$(date +%s)"
mkdir -p "${ROARY_OUT}"
roary -e --mafft -p "${THREADS}" -cd 95 -i 95 \
  -f "${ROARY_OUT}" \
  "${GFFS_DIR}"/*.gff \
  > "${WORKDIR}/logs/roary.stdout.txt" \
  2> "${WORKDIR}/logs/roary.stderr.txt"

CORE_ALN="${ROARY_OUT}/core_gene_alignment.aln"
if [[ ! -f "${CORE_ALN}" ]]; then
  echo "ERROR: core alignment not found: ${CORE_ALN}" >&2
  exit 1
fi
readlink -f "${CORE_ALN}" > "${CORE_ALIGN_PATH_FILE}"

# ------------------------------------------------------------------------------
# Step 8: RAxML-NG
log "Running RAxML-NG..."
rm -rf "${RAXML_DIR}"
mkdir -p "${RAXML_DIR}"
raxml-ng --all \
  --msa "$(cat "${CORE_ALIGN_PATH_FILE}")" \
  --model GTR+G \
  --bs-trees 1000 \
  --threads "${THREADS}" \
  --prefix "${RAXML_DIR}/core" \
  > "${WORKDIR}/logs/raxml.stdout.txt" \
  2> "${WORKDIR}/logs/raxml.stderr.txt"

SUPPORT_FILE="${RAXML_DIR}/core.raxml.support"
if [[ ! -f "${SUPPORT_FILE}" ]]; then
  echo "ERROR: RAxML support file not found: ${SUPPORT_FILE}" >&2
  exit 1
fi

# ------------------------------------------------------------------------------
# Step 9: Generate labels.tsv (basic)
log "Generating labels.tsv..."
{
  echo -e "accession\tdisplay"
  echo -e "GE11174\tGE11174"
  while IFS=$'\t' read -r label acc; do
    [[ "${label}" == "label" ]] && continue
    echo -e "${acc}\t${label} (${acc})"
  done < "${ACCESSIONS_TSV}"
} > "${LABELS_TSV}"

log "NOTE: You may want to manually edit ${LABELS_TSV} for publication display names."

# ------------------------------------------------------------------------------
# Step 10: Plot
log "Plotting..."
PLOT_SCRIPT="${SCRIPT_DIR}/plot_tree_v4.R"
OUTPDF="${PLOT_DIR}/core_tree.pdf"
OUTPNG="${PLOT_DIR}/core_tree.png"
ROOT_N=6

Rscript "${PLOT_SCRIPT}" \
  "${SUPPORT_FILE}" \
  "${LABELS_TSV}" \
  "${ROOT_N}" \
  "${OUTPDF}" \
  "${OUTPNG}" \
  > "${WORKDIR}/logs/plot.stdout.txt" \
  2> "${WORKDIR}/logs/plot.stderr.txt"

log "Done. Outputs:"
log "  Tree support: ${SUPPORT_FILE}"
log "  Labels:       ${LABELS_TSV}"
log "  Plot PDF:     ${OUTPDF}"
log "  Plot PNG:     ${OUTPNG}"

File: resolve_best_assemblies_entrez.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

"""
resolve_best_assemblies_entrez.py

Resolve a "best" assembly accession for a list of target taxa / accessions using NCBI Entrez.

Usage:
  ./resolve_best_assemblies_entrez.py targets.tsv resolved_accessions.tsv

Input (targets.tsv):
  TSV with at least columns:
    label 
<tab> query
  Where "query" can be an organism name, taxid, or an assembly/accession hint.

Output (resolved_accessions.tsv):
  TSV with columns:
    label 
<tab> accession <tab> organism <tab> assembly_name <tab> assembly_level <tab> refseq_category

Requires:
  - BioPython (Entrez)
  - NCBI_EMAIL environment variable (or set in script)
"""

import os
import sys
import time
import csv
from typing import Dict, List, Optional, Tuple

try:
    from Bio import Entrez
except ImportError:
    sys.stderr.write("ERROR: Biopython is required (Bio.Entrez)\n")
    sys.exit(1)

def eprint(*args, **kwargs):
    print(*args, file=sys.stderr, **kwargs)

def read_targets(path: str) -> List[Tuple[str, str]]:
    rows: List[Tuple[str, str]] = []
    with open(path, "r", newline="") as fh:
        reader = csv.reader(fh, delimiter="\t")
        for i, r in enumerate(reader, start=1):
            if not r:
                continue
            if i == 1 and r[0].lower() in ("label", "name"):
                # header
                continue
            if len(r) < 2:
                continue
            label = r[0].strip()
            query = r[1].strip()
            if label and query:
                rows.append((label, query))
    return rows

def entrez_search(db: str, term: str, retmax: int = 20) -> List[str]:
    handle = Entrez.esearch(db=db, term=term, retmax=retmax)
    res = Entrez.read(handle)
    handle.close()
    return res.get("IdList", [])

def entrez_summary(db: str, ids: List[str]):
    if not ids:
        return []
    handle = Entrez.esummary(db=db, id=",".join(ids), retmode="xml")
    res = Entrez.read(handle)
    handle.close()
    return res

def pick_best_assembly(summaries) -> Optional[Dict]:
    """
    Heuristics:
      Prefer RefSeq (refseq_category != 'na'), prefer higher assembly level:
        complete genome > chromosome > scaffold > contig
      Then prefer latest / highest quality where possible.
    """
    if not summaries:
        return None

    level_rank = {
        "Complete Genome": 4,
        "Chromosome": 3,
        "Scaffold": 2,
        "Contig": 1
    }

    def score(s: Dict) -> Tuple[int, int, int]:
        refcat = s.get("RefSeq_category", "na")
        is_refseq = 1 if (refcat and refcat.lower() != "na") else 0
        level = s.get("AssemblyStatus", "")
        lvl = level_rank.get(level, 0)
        # Prefer latest submit date (YYYY/MM/DD)
        submit = s.get("SubmissionDate", "0000/00/00")
        try:
            y, m, d = submit.split("/")
            date_int = int(y) * 10000 + int(m) * 100 + int(d)
        except Exception:
            date_int = 0
        return (is_refseq, lvl, date_int)

    best = max(summaries, key=score)
    return best

def resolve_query(label: str, query: str) -> Optional[Dict]:
    # If query looks like an assembly accession, search directly.
    term = query
    if query.startswith("GCA_") or query.startswith("GCF_"):
        term = f"{query}[Assembly Accession]"

    ids = entrez_search(db="assembly", term=term, retmax=50)
    if not ids:
        # Try organism name search
        term2 = f"{query}[Organism]"
        ids = entrez_search(db="assembly", term=term2, retmax=50)

    if not ids:
        eprint(f"WARNING: no assembly hits for {label} / {query}")
        return None

    summaries = entrez_summary(db="assembly", ids=ids)
    best = pick_best_assembly(summaries)
    if not best:
        eprint(f"WARNING: could not pick best assembly for {label} / {query}")
        return None

    # Extract useful fields
    acc = best.get("AssemblyAccession", "")
    org = best.get("Organism", "")
    name = best.get("AssemblyName", "")
    level = best.get("AssemblyStatus", "")
    refcat = best.get("RefSeq_category", "")

    return {
        "label": label,
        "accession": acc,
        "organism": org,
        "assembly_name": name,
        "assembly_level": level,
        "refseq_category": refcat
    }

def main():
    if len(sys.argv) != 3:
        eprint("Usage: resolve_best_assemblies_entrez.py targets.tsv resolved_accessions.tsv")
        sys.exit(1)

    targets_path = sys.argv[1]
    out_path = sys.argv[2]

    email = os.environ.get("NCBI_EMAIL") or os.environ.get("ENTREZ_EMAIL")
    if not email:
        eprint("ERROR: please set NCBI_EMAIL environment variable (e.g., export NCBI_EMAIL='you@domain')")
        sys.exit(1)
    Entrez.email = email

    targets = read_targets(targets_path)
    if not targets:
        eprint("ERROR: no targets found in input TSV")
        sys.exit(1)

    out_rows: List[Dict] = []
    for label, query in targets:
        eprint(f"Resolving: {label}\t{query}")
        res = resolve_query(label, query)
        if res:
            out_rows.append(res)
        time.sleep(0.34)  # be nice to NCBI

    with open(out_path, "w", newline="") as fh:
        w = csv.writer(fh, delimiter="\t")
        w.writerow(["label", "accession", "organism", "assembly_name", "assembly_level", "refseq_category"])
        for r in out_rows:
            w.writerow([
                r.get("label", ""),
                r.get("accession", ""),
                r.get("organism", ""),
                r.get("assembly_name", ""),
                r.get("assembly_level", ""),
                r.get("refseq_category", "")
            ])

    eprint(f"Wrote: {out_path} ({len(out_rows)} rows)")

if __name__ == "__main__":
    main()

File: make_table1_GE11174.sh

#!/usr/bin/env bash
set -euo pipefail

# make_table1_GE11174.sh
#
# Generate a "Table 1" summary for sample GE11174:
# - sequencing summary (reads, mean length, etc.)
# - assembly stats
# - BUSCO, N50, etc.
#
# Expects to be run with:
#   ENV_NAME=gunc_env AUTO_INSTALL=1 THREADS=32 ~/Scripts/make_table1_GE11174.sh
#
# This script writes work products to:
#   table1_GE11174_work/

SAMPLE="${SAMPLE:-GE11174}"
THREADS="${THREADS:-8}"
WORKDIR="${WORKDIR:-table1_${SAMPLE}_work}"

AUTO_INSTALL="${AUTO_INSTALL:-0}"
ENV_NAME="${ENV_NAME:-}"

log() {
  echo "[$(date '+%F %T')] $*" >&2
}

# Activate conda env if requested
if [[ -n "${ENV_NAME}" ]]; then
  # shellcheck disable=SC1090
  source "$(conda info --base)/etc/profile.d/conda.sh"
  conda activate "${ENV_NAME}"
fi

mkdir -p "${WORKDIR}"
mkdir -p "${WORKDIR}/logs"

# ------------------------------------------------------------------------------
# Basic tool checks
need_cmd() {
  command -v "$1" >/dev/null 2>&1 || {
    echo "ERROR: required command '$1' not found in PATH" >&2
    exit 1
  }
}

need_cmd awk
need_cmd grep
need_cmd sed
need_cmd wc
need_cmd python

# Optional tools
if command -v seqkit >/dev/null 2>&1; then
  HAVE_SEQKIT=1
else
  HAVE_SEQKIT=0
fi

if command -v pigz >/dev/null 2>&1; then
  HAVE_PIGZ=1
else
  HAVE_PIGZ=0
fi

# ------------------------------------------------------------------------------
# Inputs
RAWREADS="${RAWREADS:-${SAMPLE}.rawreads.fastq.gz}"
ASM_FASTA="${ASM_FASTA:-${SAMPLE}.fasta}"

if [[ ! -f "${RAWREADS}" ]]; then
  log "WARNING: raw reads file not found: ${RAWREADS}"
fi

if [[ ! -f "${ASM_FASTA}" ]]; then
  log "WARNING: assembly fasta not found: ${ASM_FASTA}"
fi

# ------------------------------------------------------------------------------
# Sequencing summary
log "Computing sequencing summary..."
READS_N="NA"
MEAN_LEN="NA"

if [[ -f "${RAWREADS}" ]]; then
  if [[ "${HAVE_PIGZ}" -eq 1 ]]; then
    READS_N="$(pigz -dc "${RAWREADS}" | awk 'END{print NR/4}')"
  else
    READS_N="$(gzip -dc "${RAWREADS}" | awk 'END{print NR/4}')"
  fi

  if [[ "${HAVE_SEQKIT}" -eq 1 ]]; then
    # parse seqkit stats output
    MEAN_LEN="$(seqkit stats "${RAWREADS}" | awk 'NR==2{print $8}')"
  fi
fi

# ------------------------------------------------------------------------------
# Assembly stats (simple)
log "Computing assembly stats..."
ASM_SIZE="NA"
ASM_CONTIGS="NA"

if [[ -f "${ASM_FASTA}" ]]; then
  # Count contigs and sum length
  ASM_CONTIGS="$(grep -c '^>' "${ASM_FASTA}" || true)"
  ASM_SIZE="$(grep -v '^>' "${ASM_FASTA}" | tr -d '\n' | wc -c | awk '{print $1}')"
fi

# ------------------------------------------------------------------------------
# Output a basic TSV summary (can be expanded)
OUT_TSV="${WORKDIR}/table1_${SAMPLE}.tsv"
{
  echo -e "sample\treads_total\tmean_read_length_bp\tassembly_contigs\tassembly_size_bp"
  echo -e "${SAMPLE}\t${READS_N}\t${MEAN_LEN}\t${ASM_CONTIGS}\t${ASM_SIZE}"
} > "${OUT_TSV}"

log "Wrote: ${OUT_TSV}"

File: export_table1_stats_to_excel_py36_compat.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Export a comprehensive Excel workbook from a Table1 pipeline workdir.
Python 3.6 compatible (no PEP604 unions, no builtin generics).
Requires: openpyxl

Sheets (as available):
- Summary
- Table1 (if Table1_*.tsv exists)
- QUAST_report (report.tsv)
- QUAST_metrics (metric/value)
- Mosdepth_summary (*.mosdepth.summary.txt)
- CheckM (checkm_summary.tsv)
- GUNC_* (all .tsv under gunc/out)
- File_Inventory (relative path, size, mtime; optional md5 for small files)
- Run_log_preview (head/tail of latest log under workdir/logs or workdir/*/logs)
"""

from __future__ import print_function

import argparse
import csv
import hashlib
import os
import sys
import time
from pathlib import Path

try:
    from openpyxl import Workbook
    from openpyxl.utils import get_column_letter
except ImportError:
    sys.stderr.write("ERROR: openpyxl is required. Install with:\n"
                     "  conda install -c conda-forge openpyxl\n")
    raise

MAX_XLSX_ROWS = 1048576

def safe_sheet_name(name, used):
    # Excel: <=31 chars, cannot contain: : \ / ? * [ ]
    bad = r'[:\\/?*\[\]]'
    base = name.strip() or "Sheet"
    base = __import__("re").sub(bad, "_", base)
    base = base[:31]
    if base not in used:
        used.add(base)
        return base
    # make unique with suffix
    for i in range(2, 1000):
        suffix = "_%d" % i
        cut = 31 - len(suffix)
        candidate = (base[:cut] + suffix)
        if candidate not in used:
            used.add(candidate)
            return candidate
    raise RuntimeError("Too many duplicate sheet names for base=%s" % base)

def autosize(ws, max_width=60):
    for col in ws.columns:
        max_len = 0
        col_letter = get_column_letter(col[0].column)
        for cell in col:
            v = cell.value
            if v is None:
                continue
            s = str(v)
            if len(s) > max_len:
                max_len = len(s)
        ws.column_dimensions[col_letter].width = min(max_width, max(10, max_len + 2))

def write_table(ws, header, rows, max_rows=None):
    if header:
        ws.append(header)
    count = 0
    for r in rows:
        ws.append(r)
        count += 1
        if max_rows is not None and count >= max_rows:
            break

def read_tsv(path, max_rows=None):
    header = []
    rows = []
    with path.open("r", newline="") as f:
        reader = csv.reader(f, delimiter="\t")
        for i, r in enumerate(reader):
            if i == 0:
                header = r
                continue
            rows.append(r)
            if max_rows is not None and len(rows) >= max_rows:
                break
    return header, rows

def read_text_table(path, max_rows=None):
    # for mosdepth summary (tsv with header)
    return read_tsv(path, max_rows=max_rows)

def md5_file(path, chunk=1024*1024):
    h = hashlib.md5()
    with path.open("rb") as f:
        while True:
            b = f.read(chunk)
            if not b:
                break
            h.update(b)
    return h.hexdigest()

def find_latest_log(workdir):
    candidates = []
    # common locations
    for p in [workdir / "logs", workdir / "log", workdir / "Logs"]:
        if p.exists():
            candidates.extend(p.glob("*.log"))
    # nested logs
    candidates.extend(workdir.glob("**/logs/*.log"))
    if not candidates:
        return None
    candidates.sort(key=lambda x: x.stat().st_mtime, reverse=True)
    return candidates[0]

def add_summary_sheet(wb, used, info_items):
    ws = wb.create_sheet(title=safe_sheet_name("Summary", used))
    ws.append(["Key", "Value"])
    for k, v in info_items:
        ws.append([k, v])
    autosize(ws)

def add_log_preview(wb, used, log_path, head_n=80, tail_n=120):
    if log_path is None or not log_path.exists():
        return
    ws = wb.create_sheet(title=safe_sheet_name("Run_log_preview", used))
    ws.append(["Log path", str(log_path)])
    ws.append([])
    lines = log_path.read_text(errors="replace").splitlines()
    ws.append(["--- HEAD (%d) ---" % head_n])
    for line in lines[:head_n]:
        ws.append([line])
    ws.append([])
    ws.append(["--- TAIL (%d) ---" % tail_n])
    for line in lines[-tail_n:]:
        ws.append([line])
    ws.column_dimensions["A"].width = 120

def add_file_inventory(wb, used, workdir, do_md5=True, md5_max_bytes=200*1024*1024, max_rows=None):
    ws = wb.create_sheet(title=safe_sheet_name("File_Inventory", used))
    ws.append(["relative_path", "size_bytes", "mtime_iso", "md5(optional)"])
    count = 0
    for p in sorted(workdir.rglob("*")):
        if p.is_dir():
            continue
        rel = str(p.relative_to(workdir))
        st = p.stat()
        mtime = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(st.st_mtime))
        md5 = ""
        if do_md5 and st.st_size <= md5_max_bytes:
            try:
                md5 = md5_file(p)
            except Exception:
                md5 = "ERROR"
        ws.append([rel, st.st_size, mtime, md5])
        count += 1
        if max_rows is not None and count >= max_rows:
            break
    autosize(ws, max_width=80)

def add_tsv_sheet(wb, used, name, path, max_rows=None):
    header, rows = read_tsv(path, max_rows=max_rows)
    ws = wb.create_sheet(title=safe_sheet_name(name, used))
    write_table(ws, header, rows, max_rows=max_rows)
    autosize(ws, max_width=80)

def add_quast_metrics_sheet(wb, used, quast_report_tsv):
    header, rows = read_tsv(quast_report_tsv, max_rows=None)
    if not header or len(header) < 2:
        return
    asm_name = header[1]
    ws = wb.create_sheet(title=safe_sheet_name("QUAST_metrics", used))
    ws.append(["Metric", asm_name])
    for r in rows:
        if not r:
            continue
        metric = r[0]
        val = r[1] if len(r) > 1 else ""
        ws.append([metric, val])
    autosize(ws, max_width=80)

def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--workdir", required=True, help="workdir produced by pipeline (e.g., table1_GE11174_work)")
    ap.add_argument("--out", required=True, help="output .xlsx")
    ap.add_argument("--sample", default="", help="sample name for summary")
    ap.add_argument("--max-rows", type=int, default=200000, help="max rows per large sheet")
    ap.add_argument("--no-md5", action="store_true", help="skip md5 calculation in File_Inventory")
    args = ap.parse_args()

    workdir = Path(args.workdir).resolve()
    out = Path(args.out).resolve()

    if not workdir.exists():
        sys.stderr.write("ERROR: workdir not found: %s\n" % workdir)
        sys.exit(2)

    wb = Workbook()
    # remove default sheet
    wb.remove(wb.active)
    used = set()

    # Summary info
    info = [
        ("sample", args.sample or ""),
        ("workdir", str(workdir)),
        ("generated_at", time.strftime("%Y-%m-%d %H:%M:%S")),
        ("python", sys.version.replace("\n", " ")),
        ("openpyxl", __import__("openpyxl").__version__),
    ]
    add_summary_sheet(wb, used, info)

    # Table1 TSV (try common names)
    table1_candidates = list(workdir.glob("Table1_*.tsv")) + list(workdir.glob("*.tsv"))
    # Prefer Table1_*.tsv in workdir root
    table1_path = None
    for p in table1_candidates:
        if p.name.startswith("Table1_") and p.suffix == ".tsv":
            table1_path = p
            break
    if table1_path is None:
        # maybe created in cwd, not inside workdir; try alongside workdir
        parent = workdir.parent
        for p in parent.glob("Table1_*.tsv"):
            if args.sample and args.sample in p.name:
                table1_path = p
                break
        if table1_path is None and list(parent.glob("Table1_*.tsv")):
            table1_path = sorted(parent.glob("Table1_*.tsv"))[0]

    if table1_path is not None and table1_path.exists():
        add_tsv_sheet(wb, used, "Table1", table1_path, max_rows=args.max_rows)

    # QUAST
    quast_report = workdir / "quast" / "report.tsv"
    if quast_report.exists():
        add_tsv_sheet(wb, used, "QUAST_report", quast_report, max_rows=args.max_rows)
        add_quast_metrics_sheet(wb, used, quast_report)

    # Mosdepth summary
    for p in sorted((workdir / "map").glob("*.mosdepth.summary.txt")):
        # mosdepth summary is TSV-like
        name = "Mosdepth_" + p.stem.replace(".mosdepth.summary", "")
        add_tsv_sheet(wb, used, name[:31], p, max_rows=args.max_rows)

    # CheckM
    checkm_sum = workdir / "checkm" / "checkm_summary.tsv"
    if checkm_sum.exists():
        add_tsv_sheet(wb, used, "CheckM", checkm_sum, max_rows=args.max_rows)

    # GUNC outputs (all TSV under gunc/out)
    gunc_out = workdir / "gunc" / "out"
    if gunc_out.exists():
        for p in sorted(gunc_out.rglob("*.tsv")):
            rel = str(p.relative_to(gunc_out))
            sheet = "GUNC_" + rel.replace("/", "_").replace("\\", "_").replace(".tsv", "")
            add_tsv_sheet(wb, used, sheet[:31], p, max_rows=args.max_rows)

    # Log preview
    latest_log = find_latest_log(workdir)
    add_log_preview(wb, used, latest_log)

    # File inventory
    add_file_inventory(
        wb, used, workdir,
        do_md5=(not args.no_md5),
        md5_max_bytes=200*1024*1024,
        max_rows=args.max_rows
    )

    # Save
    out.parent.mkdir(parents=True, exist_ok=True)
    wb.save(str(out))
    print("OK: wrote %s" % out)

if __name__ == "__main__":
    main()

File: make_table1_with_excel.sh

#!/usr/bin/env bash
set -euo pipefail

# make_table1_with_excel.sh
#
# Wrapper to run:
#   1) make_table1_* (stats extraction)
#   2) export_table1_stats_to_excel_py36_compat.py
#
# Example:
#   ENV_NAME=gunc_env AUTO_INSTALL=1 THREADS=32 ~/Scripts/make_table1_with_excel.sh

SAMPLE="${SAMPLE:-GE11174}"
THREADS="${THREADS:-8}"
WORKDIR="${WORKDIR:-table1_${SAMPLE}_work}"
OUT_XLSX="${OUT_XLSX:-Comprehensive_${SAMPLE}.xlsx}"

ENV_NAME="${ENV_NAME:-}"
AUTO_INSTALL="${AUTO_INSTALL:-0}"

log() {
  echo "[$(date '+%F %T')] $*" >&2
}

# Activate conda env if requested
if [[ -n "${ENV_NAME}" ]]; then
  # shellcheck disable=SC1090
  source "$(conda info --base)/etc/profile.d/conda.sh"
  conda activate "${ENV_NAME}"
fi

mkdir -p "${WORKDIR}"

# ------------------------------------------------------------------------------
# Locate scripts
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
MAKE_TABLE1="${MAKE_TABLE1_SCRIPT:-${SCRIPT_DIR}/make_table1_${SAMPLE}.sh}"
EXPORT_PY="${EXPORT_PY_SCRIPT:-${SCRIPT_DIR}/export_table1_stats_to_excel_py36_compat.py}"

# Fallback for naming mismatch (e.g., make_table1_GE11174.sh)
if [[ ! -f "${MAKE_TABLE1}" ]]; then
  MAKE_TABLE1="${SCRIPT_DIR}/make_table1_GE11174.sh"
fi

if [[ ! -f "${MAKE_TABLE1}" ]]; then
  echo "ERROR: make_table1 script not found" >&2
  exit 1
fi

if [[ ! -f "${EXPORT_PY}" ]]; then
  log "WARNING: export_table1_stats_to_excel_py36_compat.py not found next to this script."
  log "         You can set EXPORT_PY_SCRIPT=/path/to/export_table1_stats_to_excel_py36_compat.py"
fi

# ------------------------------------------------------------------------------
# Step 1
log "STEP 1: generating workdir stats..."
ENV_NAME="${ENV_NAME}" AUTO_INSTALL="${AUTO_INSTALL}" THREADS="${THREADS}" SAMPLE="${SAMPLE}" WORKDIR="${WORKDIR}" \
  bash "${MAKE_TABLE1}"

# ------------------------------------------------------------------------------
# Step 2
if [[ -f "${EXPORT_PY}" ]]; then
  log "STEP 2: exporting to Excel..."
  python "${EXPORT_PY}" \
    --workdir "${WORKDIR}" \
    --out "${OUT_XLSX}" \
    --max-rows 200000 \
    --sample "${SAMPLE}"
  log "Wrote: ${OUT_XLSX}"
else
  log "Skipped Excel export (missing export script). Workdir still produced: ${WORKDIR}"
fi

File: merge_amr_sources_by_gene.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

"""
merge_amr_sources_by_gene.py

Merge AMR calls from multiple sources (e.g., ABRicate outputs from MEGARes/ResFinder
and RGI/CARD) by gene name, producing a combined table suitable for reporting/export.

This script is intentionally lightweight and focuses on:
- reading tabular ABRicate outputs
- normalizing gene names
- merging into a per-gene summary

Expected inputs/paths are typically set in your working directory structure.
"""

import os
import sys
import csv
from collections import defaultdict
from typing import Dict, List, Tuple

def eprint(*args, **kwargs):
    print(*args, file=sys.stderr, **kwargs)

def read_abricate_tab(path: str) -> List[Dict[str, str]]:
    rows: List[Dict[str, str]] = []
    with open(path, "r", newline="") as fh:
        for line in fh:
            if line.startswith("#") or not line.strip():
                continue
            # ABRicate default is tab-delimited with columns:
            # FILE, SEQUENCE, START, END, STRAND, GENE, COVERAGE, COVERAGE_MAP, GAPS,
            # %COVERAGE, %IDENTITY, DATABASE, ACCESSION, PRODUCT, RESISTANCE
            parts = line.rstrip("\n").split("\t")
            if len(parts) < 12:
                continue
            gene = parts[5].strip()
            rows.append({
                "gene": gene,
                "identity": parts[10].strip(),
                "coverage": parts[9].strip(),
                "db": parts[11].strip(),
                "product": parts[13].strip() if len(parts) > 13 else "",
                "raw": line.rstrip("\n")
            })
    return rows

def normalize_gene(gene: str) -> str:
    g = gene.strip()
    # Add any project-specific normalization rules here
    return g

def merge_sources(sources: List[Tuple[str, str]]) -> Dict[str, Dict[str, List[Dict[str, str]]]]:
    merged: Dict[str, Dict[str, List[Dict[str, str]]]] = defaultdict(lambda: defaultdict(list))
    for src_name, path in sources:
        if not os.path.exists(path):
            eprint(f"WARNING: missing source file: {path}")
            continue
        rows = read_abricate_tab(path)
        for r in rows:
            g = normalize_gene(r["gene"])
            merged[g][src_name].append(r)
    return merged

def write_merged_tsv(out_path: str, merged: Dict[str, Dict[str, List[Dict[str, str]]]]):
    # Flatten into a simple TSV
    with open(out_path, "w", newline="") as fh:
        w = csv.writer(fh, delimiter="\t")
        w.writerow(["gene", "sources", "best_identity", "best_coverage", "notes"])
        for gene, src_map in sorted(merged.items()):
            srcs = sorted(src_map.keys())
            best_id = ""
            best_cov = ""
            notes = []
            # pick best identity/coverage across all hits
            for s in srcs:
                for r in src_map[s]:
                    if not best_id or float(r["identity"]) > float(best_id):
                        best_id = r["identity"]
                    if not best_cov or float(r["coverage"]) > float(best_cov):
                        best_cov = r["coverage"]
                    if r.get("product"):
                        notes.append(f"{s}:{r['product']}")
            w.writerow([gene, ",".join(srcs), best_id, best_cov, "; ".join(notes)])

def main():
    # Default expected layout (customize as needed)
    workdir = os.environ.get("WORKDIR", "resistome_virulence_GE11174")
    sample = os.environ.get("SAMPLE", "GE11174")

    rawdir = os.path.join(workdir, "raw")
    sources = [
        ("MEGARes", os.path.join(rawdir, f"{sample}.megares.tab")),
        ("CARD", os.path.join(rawdir, f"{sample}.card.tab")),
        ("ResFinder", os.path.join(rawdir, f"{sample}.resfinder.tab")),
        ("VFDB", os.path.join(rawdir, f"{sample}.vfdb.tab")),
    ]

    merged = merge_sources(sources)
    out_path = os.path.join(workdir, f"merged_by_gene_{sample}.tsv")
    write_merged_tsv(out_path, merged)
    eprint(f"Wrote merged table: {out_path}")

if __name__ == "__main__":
    main()

File: export_resistome_virulence_to_excel_py36.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

"""
export_resistome_virulence_to_excel_py36.py

Export resistome + virulence profiling outputs to an Excel workbook, compatible with
older Python (3.6) style environments.

Typical usage:
  python export_resistome_virulence_to_excel_py36.py \
    --workdir resistome_virulence_GE11174 \
    --sample GE11174 \
    --out Resistome_Virulence_GE11174.xlsx

Requires:
  - openpyxl
"""

import os
import sys
import csv
import argparse
from typing import List, Dict

try:
    from openpyxl import Workbook
except ImportError:
    sys.stderr.write("ERROR: openpyxl is required\n")
    sys.exit(1)

def eprint(*args, **kwargs):
    print(*args, file=sys.stderr, **kwargs)

def read_tab_file(path: str) -> List[List[str]]:
    rows: List[List[str]] = []
    with open(path, "r", newline="") as fh:
        for line in fh:
            if line.startswith("#") or not line.strip():
                continue
            rows.append(line.rstrip("\n").split("\t"))
    return rows

def autosize(ws):
    # basic autosize columns
    for col_cells in ws.columns:
        max_len = 0
        col_letter = col_cells[0].column_letter
        for c in col_cells:
            if c.value is None:
                continue
            max_len = max(max_len, len(str(c.value)))
        ws.column_dimensions[col_letter].width = min(max_len + 2, 60)

def add_sheet_from_tab(wb: Workbook, title: str, path: str):
    ws = wb.create_sheet(title=title)
    if not os.path.exists(path):
        ws.append([f"Missing file: {path}"])
        return
    rows = read_tab_file(path)
    if not rows:
        ws.append(["No rows"])
        return
    for r in rows:
        ws.append(r)
    autosize(ws)

def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--workdir", required=True)
    ap.add_argument("--sample", required=True)
    ap.add_argument("--out", required=True)
    args = ap.parse_args()

    workdir = args.workdir
    sample = args.sample
    out_xlsx = args.out

    rawdir = os.path.join(workdir, "raw")

    files = {
        "MEGARes": os.path.join(rawdir, f"{sample}.megares.tab"),
        "CARD": os.path.join(rawdir, f"{sample}.card.tab"),
        "ResFinder": os.path.join(rawdir, f"{sample}.resfinder.tab"),
        "VFDB": os.path.join(rawdir, f"{sample}.vfdb.tab"),
        "Merged_by_gene": os.path.join(workdir, f"merged_by_gene_{sample}.tsv"),
    }

    wb = Workbook()
    # Remove default sheet
    default = wb.active
    wb.remove(default)

    for title, path in files.items():
        eprint(f"Adding sheet: {title} <- {path}")
        add_sheet_from_tab(wb, title, path)

    wb.save(out_xlsx)
    eprint(f"Wrote Excel: {out_xlsx}")

if __name__ == "__main__":
    main()

File: plot_tree_v4.R

#!/usr/bin/env Rscript

# plot_tree_v4.R
#
# Plot a RAxML-NG support tree with custom labels.
#
# Args:
#   1) support tree file (e.g., core.raxml.support)
#   2) labels.tsv (columns: accession
<TAB>display)
#   3) root N (numeric, e.g., 6)
#   4) output PDF
#   5) output PNG

suppressPackageStartupMessages({
  library(ape)
  library(ggplot2)
  library(ggtree)
  library(dplyr)
  library(readr)
  library(aplot)
})

args <- commandArgs(trailingOnly=TRUE)
if (length(args) < 5) {
  cat("Usage: plot_tree_v4.R 
<support_tree> <labels.tsv> <root_n> <out.pdf> <out.png>\n")
  quit(status=1)
}

support_tree <- args[1]
labels_tsv <- args[2]
root_n <- as.numeric(args[3])
out_pdf <- args[4]
out_png <- args[5]

# Read tree
tr <- read.tree(support_tree)

# Read labels
lab <- read_tsv(labels_tsv, col_types=cols(.default="c"))
colnames(lab) <- c("accession","display")

# Map labels
# Current tip labels may include accession-like tokens.
# We'll try exact match first; otherwise keep original.
tip_map <- setNames(lab$display, lab$accession)
new_labels <- sapply(tr$tip.label, function(x) {
  if (x %in% names(tip_map)) {
    tip_map[[x]]
  } else {
    x
  }
})
tr$tip.label <- new_labels

# Root by nth tip if requested
if (!is.na(root_n) && root_n > 0 && root_n <= length(tr$tip.label)) {
  tr <- root(tr, outgroup=tr$tip.label[root_n], resolve.root=TRUE)
}

# Plot
p <- ggtree(tr) +
  geom_tiplab(size=3) +
  theme_tree2()

# Save
ggsave(out_pdf, plot=p, width=8, height=8)
ggsave(out_png, plot=p, width=8, height=8, dpi=300)

cat(sprintf("Wrote: %s\nWrote: %s\n", out_pdf, out_png))

File: run_resistome_virulome_dedup.sh

#!/usr/bin/env bash
set -Eeuo pipefail

# -------- user inputs --------
ENV_NAME="${ENV_NAME:-bengal3_ac3}"
ASM="${ASM:-GE11174.fasta}"
SAMPLE="${SAMPLE:-GE11174}"
OUTDIR="${OUTDIR:-resistome_virulence_${SAMPLE}}"
THREADS="${THREADS:-16}"

# thresholds (set to 0/0 if you truly want ABRicate defaults)
MINID="${MINID:-90}"
MINCOV="${MINCOV:-60}"
# ----------------------------

log(){ echo "[$(date +'%F %T')] $*" >&2; }
need_cmd(){ command -v "$1" >/dev/null 2>&1; }

activate_env() {
  # shellcheck disable=SC1091
  source "$(conda info --base)/etc/profile.d/conda.sh"
  conda activate "${ENV_NAME}"
}

main(){
  activate_env

  mkdir -p "${OUTDIR}"/{raw,amr,virulence,card,tmp}

  log "Env:    ${ENV_NAME}"
  log "ASM:    ${ASM}"
  log "Sample: ${SAMPLE}"
  log "Outdir: ${OUTDIR}"
  log "ABRicate thresholds: MINID=${MINID} MINCOV=${MINCOV}"

  log "ABRicate DB list:"
  abricate --list | egrep -i "vfdb|resfinder|megares|card" || true

  # Make sure indices exist
  log "Running abricate --setupdb (safe even if already done)..."
  abricate --setupdb

  # ---- ABRicate AMR DBs ----
  log "Running ABRicate: ResFinder"
  abricate --db resfinder --minid "${MINID}" --mincov "${MINCOV}" "${ASM}" > "${OUTDIR}/raw/${SAMPLE}.resfinder.tab"

  log "Running ABRicate: MEGARes"
  abricate --db megares   --minid "${MINID}" --mincov "${MINCOV}" "${ASM}" > "${OUTDIR}/raw/${SAMPLE}.megares.tab"

  # ---- Virulence (VFDB) ----
  log "Running ABRicate: VFDB"
  abricate --db vfdb      --minid "${MINID}" --mincov "${MINCOV}" "${ASM}" > "${OUTDIR}/raw/${SAMPLE}.vfdb.tab"

  # ---- CARD: prefer RGI if available, else ABRicate card ----
  CARD_MODE="ABRicate"
  if need_cmd rgi; then
    log "RGI found. Trying RGI (CARD) ..."
    set +e
    rgi main --input_sequence "${ASM}" --output_file "${OUTDIR}/card/${SAMPLE}.rgi" --input_type contig --num_threads "${THREADS}"
    rc=$?
    set -e
    if [[ $rc -eq 0 ]]; then
      CARD_MODE="RGI"
    else
      log "RGI failed (likely CARD data not installed). Falling back to ABRicate card."
    fi
  fi

  if [[ "${CARD_MODE}" == "ABRicate" ]]; then
    log "Running ABRicate: CARD"
    abricate --db card --minid "${MINID}" --mincov "${MINCOV}" "${ASM}" > "${OUTDIR}/raw/${SAMPLE}.card.tab"
  fi

  # ---- Build deduplicated tables ----
  log "Creating deduplicated AMR/VFDB tables..."

  export OUTDIR SAMPLE CARD_MODE
  python - <<'PY'
import os, re
from pathlib import Path
import pandas as pd
from io import StringIO

outdir = Path(os.environ["OUTDIR"])
sample = os.environ["SAMPLE"]
card_mode = os.environ["CARD_MODE"]

def read_abricate_tab(path: Path, source: str) -> pd.DataFrame:
    if not path.exists() or path.stat().st_size == 0:
        return pd.DataFrame()
    lines=[]
    with path.open("r", errors="replace") as f:
        for line in f:
            if line.startswith("#") or not line.strip():
                continue
            lines.append(line)
    if not lines:
        return pd.DataFrame()
    df = pd.read_csv(StringIO("".join(lines)), sep="\t", dtype=str)
    df.insert(0, "Source", source)
    return df

def to_num(s):
    try:
        return float(str(s).replace("%",""))
    except:
        return None

def normalize_abricate(df: pd.DataFrame, dbname: str) -> pd.DataFrame:
    if df.empty:
        return pd.DataFrame(columns=[
            "Source","Database","Gene","Product","Accession","Contig","Start","End","Strand","Pct_Identity","Pct_Coverage"
        ])
    # Column names vary slightly; handle common ones
    gene = "GENE" if "GENE" in df.columns else None
    prod = "PRODUCT" if "PRODUCT" in df.columns else None
    acc  = "ACCESSION" if "ACCESSION" in df.columns else None
    contig = "SEQUENCE" if "SEQUENCE" in df.columns else ("CONTIG" if "CONTIG" in df.columns else None)
    start = "START" if "START" in df.columns else None
    end   = "END" if "END" in df.columns else None
    strand= "STRAND" if "STRAND" in df.columns else None

    pid = "%IDENTITY" if "%IDENTITY" in df.columns else ("% Identity" if "% Identity" in df.columns else None)
    pcv = "%COVERAGE" if "%COVERAGE" in df.columns else ("% Coverage" if "% Coverage" in df.columns else None)

    out = pd.DataFrame()
    out["Source"] = df["Source"]
    out["Database"] = dbname
    out["Gene"] = df[gene] if gene else ""
    out["Product"] = df[prod] if prod else ""
    out["Accession"] = df[acc] if acc else ""
    out["Contig"] = df[contig] if contig else ""
    out["Start"] = df[start] if start else ""
    out["End"] = df[end] if end else ""
    out["Strand"] = df[strand] if strand else ""
    out["Pct_Identity"] = df[pid] if pid else ""
    out["Pct_Coverage"] = df[pcv] if pcv else ""
    return out

def dedup_best(df: pd.DataFrame, key_cols):
    """Keep best hit per key by highest identity, then coverage, then longest span."""
    if df.empty:
        return df
    # numeric helpers
    df = df.copy()
    df["_pid"] = df["Pct_Identity"].map(to_num)
    df["_pcv"] = df["Pct_Coverage"].map(to_num)

    def span(row):
        try:
            return abs(int(row["End"]) - int(row["Start"])) + 1
        except:
            return 0
    df["_span"] = df.apply(span, axis=1)

    # sort best-first
    df = df.sort_values(by=["_pid","_pcv","_span"], ascending=[False,False,False], na_position="last")
    df = df.drop_duplicates(subset=key_cols, keep="first")
    df = df.drop(columns=["_pid","_pcv","_span"])
    return df

# ---------- AMR inputs ----------
amr_frames = []

# ResFinder (often 0 hits; still okay)
resfinder = outdir / "raw" / f"{sample}.resfinder.tab"
df = read_abricate_tab(resfinder, "ABRicate")
amr_frames.append(normalize_abricate(df, "ResFinder"))

# MEGARes
megares = outdir / "raw" / f"{sample}.megares.tab"
df = read_abricate_tab(megares, "ABRicate")
amr_frames.append(normalize_abricate(df, "MEGARes"))

# CARD: RGI or ABRicate
if card_mode == "RGI":
    # Try common RGI tab outputs
    prefix = outdir / "card" / f"{sample}.rgi"
    rgi_tab = None
    for ext in [".txt",".tab",".tsv"]:
        p = Path(str(prefix) + ext)
        if p.exists() and p.stat().st_size > 0:
            rgi_tab = p
            break
    if rgi_tab is not None:
        rgi = pd.read_csv(rgi_tab, sep="\t", dtype=str)
        out = pd.DataFrame()
        out["Source"] = "RGI"
        out["Database"] = "CARD"
        # Prefer ARO_name/Best_Hit_ARO if present
        out["Gene"] = rgi["ARO_name"] if "ARO_name" in rgi.columns else (rgi["Best_Hit_ARO"] if "Best_Hit_ARO" in rgi.columns else "")
        out["Product"] = rgi["ARO_name"] if "ARO_name" in rgi.columns else ""
        out["Accession"] = rgi["ARO_accession"] if "ARO_accession" in rgi.columns else ""
        out["Contig"] = rgi["Sequence"] if "Sequence" in rgi.columns else ""
        out["Start"] = rgi["Start"] if "Start" in rgi.columns else ""
        out["End"] = rgi["Stop"] if "Stop" in rgi.columns else (rgi["End"] if "End" in rgi.columns else "")
        out["Strand"] = rgi["Orientation"] if "Orientation" in rgi.columns else ""
        out["Pct_Identity"] = rgi["% Identity"] if "% Identity" in rgi.columns else ""
        out["Pct_Coverage"] = rgi["% Coverage"] if "% Coverage" in rgi.columns else ""
        amr_frames.append(out)
else:
    card = outdir / "raw" / f"{sample}.card.tab"
    df = read_abricate_tab(card, "ABRicate")
    amr_frames.append(normalize_abricate(df, "CARD"))

amr_all = pd.concat([x for x in amr_frames if not x.empty], ignore_index=True) if any(not x.empty for x in amr_frames) else pd.DataFrame(
    columns=["Source","Database","Gene","Product","Accession","Contig","Start","End","Strand","Pct_Identity","Pct_Coverage"]
)

# Deduplicate within each (Database,Gene) – this is usually what you want for manuscript tables
amr_dedup = dedup_best(amr_all, key_cols=["Database","Gene"])

# Sort nicely
if not amr_dedup.empty:
    amr_dedup = amr_dedup.sort_values(["Database","Gene"]).reset_index(drop=True)

amr_out = outdir / "Table_AMR_genes_dedup.tsv"
amr_dedup.to_csv(amr_out, sep="\t", index=False)

# ---------- Virulence (VFDB) ----------
vfdb = outdir / "raw" / f"{sample}.vfdb.tab"
vf = read_abricate_tab(vfdb, "ABRicate")
vf_norm = normalize_abricate(vf, "VFDB")

# Dedup within (Gene) for VFDB (or use Database,Gene; Database constant)
vf_dedup = dedup_best(vf_norm, key_cols=["Gene"]) if not vf_norm.empty else vf_norm
if not vf_dedup.empty:
    vf_dedup = vf_dedup.sort_values(["Gene"]).reset_index(drop=True)

vf_out = outdir / "Table_Virulence_VFDB_dedup.tsv"
vf_dedup.to_csv(vf_out, sep="\t", index=False)

print("OK wrote:")
print(" ", amr_out)
print(" ", vf_out)
PY

  log "Done."
  log "Outputs:"
  log "  ${OUTDIR}/Table_AMR_genes_dedup.tsv"
  log "  ${OUTDIR}/Table_Virulence_VFDB_dedup.tsv"
  log "Raw:"
  log "  ${OUTDIR}/raw/${SAMPLE}.*.tab"
}

main

File: run_abricate_resistome_virulome_one_per_gene.sh

#!/usr/bin/env bash
set -Eeuo pipefail

# ------------------- USER SETTINGS -------------------
ENV_NAME="${ENV_NAME:-bengal3_ac3}"

ASM="${ASM:-GE11174.fasta}"          # input assembly fasta
SAMPLE="${SAMPLE:-GE11174}"

OUTDIR="${OUTDIR:-resistome_virulence_${SAMPLE}}"
THREADS="${THREADS:-16}"

# ABRicate thresholds
# If you want your earlier "35 genes" behavior, use MINID=70 MINCOV=50.
# If you want stricter: e.g. MINID=80 MINCOV=70.
MINID="${MINID:-70}"
MINCOV="${MINCOV:-50}"
# -----------------------------------------------------

ts(){ date +"%F %T"; }
log(){ echo "[$(ts)] $*" >&2; }

on_err(){
  local ec=$?
  log "ERROR: failed (exit=${ec}) at line ${BASH_LINENO[0]}: ${BASH_COMMAND}"
  exit $ec
}
trap on_err ERR

need_cmd(){ command -v "$1" >/dev/null 2>&1; }

activate_env() {
  # shellcheck disable=SC1091
  source "$(conda info --base)/etc/profile.d/conda.sh"
  conda activate "${ENV_NAME}"
}

main(){
  activate_env

  log "Env: ${ENV_NAME}"
  log "ASM: ${ASM}"
  log "Sample: ${SAMPLE}"
  log "Outdir: ${OUTDIR}"
  log "Threads: ${THREADS}"
  log "ABRicate thresholds: MINID=${MINID} MINCOV=${MINCOV}"

  mkdir -p "${OUTDIR}"/{raw,logs}

  # Save full log
  LOGFILE="${OUTDIR}/logs/run_$(date +'%F_%H%M%S').log"
  exec > >(tee -a "${LOGFILE}") 2>&1

  log "Tool versions:"
  abricate --version || true
  abricate-get_db --help | head -n 5 || true

  log "ABRicate DB list (selected):"
  abricate --list | egrep -i "vfdb|resfinder|megares|card" || true

  log "Indexing ABRicate databases (safe to re-run)..."
  abricate --setupdb

  # ---------------- Run ABRicate ----------------
  log "Running ABRicate: MEGARes"
  abricate --db megares   --minid "${MINID}" --mincov "${MINCOV}" "${ASM}" > "${OUTDIR}/raw/${SAMPLE}.megares.tab"

  log "Running ABRicate: CARD"
  abricate --db card      --minid "${MINID}" --mincov "${MINCOV}" "${ASM}" > "${OUTDIR}/raw/${SAMPLE}.card.tab"

  log "Running ABRicate: ResFinder"
  abricate --db resfinder --minid "${MINID}" --mincov "${MINCOV}" "${ASM}" > "${OUTDIR}/raw/${SAMPLE}.resfinder.tab"

  log "Running ABRicate: VFDB"
  abricate --db vfdb      --minid "${MINID}" --mincov "${MINCOV}" "${ASM}" > "${OUTDIR}/raw/${SAMPLE}.vfdb.tab"

  # --------------- Build tables -----------------
  export OUTDIR SAMPLE
  export MEGARES_TAB="${OUTDIR}/raw/${SAMPLE}.megares.tab"
  export CARD_TAB="${OUTDIR}/raw/${SAMPLE}.card.tab"
  export RESFINDER_TAB="${OUTDIR}/raw/${SAMPLE}.resfinder.tab"
  export VFDB_TAB="${OUTDIR}/raw/${SAMPLE}.vfdb.tab"

  export AMR_OUT="${OUTDIR}/Table_AMR_genes_one_per_gene.tsv"
  export VIR_OUT="${OUTDIR}/Table_Virulence_VFDB_dedup.tsv"
  export STATUS_OUT="${OUTDIR}/Table_DB_hit_counts.tsv"

  log "Generating deduplicated tables..."
  python - <<'PY'
import os
import pandas as pd
from pathlib import Path

megares_tab   = Path(os.environ["MEGARES_TAB"])
card_tab      = Path(os.environ["CARD_TAB"])
resfinder_tab = Path(os.environ["RESFINDER_TAB"])
vfdb_tab      = Path(os.environ["VFDB_TAB"])

amr_out    = Path(os.environ["AMR_OUT"])
vir_out    = Path(os.environ["VIR_OUT"])
status_out = Path(os.environ["STATUS_OUT"])

def read_abricate(path: Path) -> pd.DataFrame:
    """Parse ABRicate .tab where header line starts with '#FILE'."""
    if (not path.exists()) or path.stat().st_size == 0:
        return pd.DataFrame()
    header = None
    rows = []
    with path.open("r", errors="replace") as f:
        for line in f:
            if not line.strip():
                continue
            if line.startswith("#FILE"):
                header = line.lstrip("#").rstrip("\n").split("\t")
                continue
            if line.startswith("#"):
                continue
            rows.append(line.rstrip("\n").split("\t"))
    if header is None:
        return pd.DataFrame()
    if not rows:
        return pd.DataFrame(columns=header)
    return pd.DataFrame(rows, columns=header)

def normalize(df: pd.DataFrame, dbname: str) -> pd.DataFrame:
    cols_out = ["Database","Gene","Product","Accession","Contig","Start","End","Strand","Pct_Identity","Pct_Coverage"]
    if df is None or df.empty:
        return pd.DataFrame(columns=cols_out)
    out = pd.DataFrame({
        "Database": dbname,
        "Gene": df.get("GENE",""),
        "Product": df.get("PRODUCT",""),
        "Accession": df.get("ACCESSION",""),
        "Contig": df.get("SEQUENCE",""),
        "Start": df.get("START",""),
        "End": df.get("END",""),
        "Strand": df.get("STRAND",""),
        "Pct_Identity": pd.to_numeric(df.get("%IDENTITY",""), errors="coerce"),
        "Pct_Coverage": pd.to_numeric(df.get("%COVERAGE",""), errors="coerce"),
    })
    return out[cols_out]

def best_hit_dedup(df: pd.DataFrame, key_cols):
    """Keep best hit by highest identity, then coverage, then alignment length."""
    if df.empty:
        return df
    d = df.copy()
    d["Start_i"] = pd.to_numeric(d["Start"], errors="coerce").fillna(0).astype(int)
    d["End_i"]   = pd.to_numeric(d["End"], errors="coerce").fillna(0).astype(int)
    d["Len"]     = (d["End_i"] - d["Start_i"]).abs() + 1
    d = d.sort_values(["Pct_Identity","Pct_Coverage","Len"], ascending=[False,False,False])
    d = d.drop_duplicates(subset=key_cols, keep="first")
    return d.drop(columns=["Start_i","End_i","Len"])

def count_hits(path: Path) -> int:
    if not path.exists():
        return 0
    n = 0
    with path.open() as f:
        for line in f:
            if line.startswith("#") or not line.strip():
                continue
            n += 1
    return n

# -------- load + normalize --------
parts = []
for dbname, p in [("MEGARes", megares_tab), ("CARD", card_tab), ("ResFinder", resfinder_tab)]:
    df = read_abricate(p)
    parts.append(normalize(df, dbname))

amr_all = pd.concat([x for x in parts if not x.empty], ignore_index=True) if any(not x.empty for x in parts) else pd.DataFrame(
    columns=["Database","Gene","Product","Accession","Contig","Start","End","Strand","Pct_Identity","Pct_Coverage"]
)

# remove empty genes
amr_all = amr_all[amr_all["Gene"].astype(str).str.len() > 0].copy()

# best per (Database,Gene)
amr_db_gene = best_hit_dedup(amr_all, ["Database","Gene"]) if not amr_all.empty else amr_all

# one row per Gene overall, priority: CARD > ResFinder > MEGARes
priority = {"CARD": 0, "ResFinder": 1, "MEGARes": 2}
if not amr_db_gene.empty:
    amr_db_gene["prio"] = amr_db_gene["Database"].map(priority).fillna(9).astype(int)
    amr_one = amr_db_gene.sort_values(
        ["Gene","prio","Pct_Identity","Pct_Coverage"],
        ascending=[True, True, False, False]
    )
    amr_one = amr_one.drop_duplicates(["Gene"], keep="first").drop(columns=["prio"])
    amr_one = amr_one.sort_values(["Gene"]).reset_index(drop=True)
else:
    amr_one = amr_db_gene

amr_out.parent.mkdir(parents=True, exist_ok=True)
amr_one.to_csv(amr_out, sep="\t", index=False)

# -------- VFDB --------
vf = normalize(read_abricate(vfdb_tab), "VFDB")
vf = vf[vf["Gene"].astype(str).str.len() > 0].copy()
vf_one = best_hit_dedup(vf, ["Gene"]) if not vf.empty else vf
if not vf_one.empty:
    vf_one = vf_one.sort_values(["Gene"]).reset_index(drop=True)

vir_out.parent.mkdir(parents=True, exist_ok=True)
vf_one.to_csv(vir_out, sep="\t", index=False)

# -------- status counts --------
status = pd.DataFrame([
    {"Database":"MEGARes",   "Hit_lines": count_hits(megares_tab),   "File": str(megares_tab)},
    {"Database":"CARD",      "Hit_lines": count_hits(card_tab),      "File": str(card_tab)},
    {"Database":"ResFinder", "Hit_lines": count_hits(resfinder_tab), "File": str(resfinder_tab)},
    {"Database":"VFDB",      "Hit_lines": count_hits(vfdb_tab),      "File": str(vfdb_tab)},
])
status_out.parent.mkdir(parents=True, exist_ok=True)
status.to_csv(status_out, sep="\t", index=False)

print("OK wrote:")
print(" ", amr_out, "rows=", len(amr_one))
print(" ", vir_out, "rows=", len(vf_one))
print(" ", status_out)
PY

  log "Finished."
  log "Main outputs:"
  log "  ${AMR_OUT}"
  log "  ${VIR_OUT}"
  log "  ${STATUS_OUT}"
  log "Raw ABRicate outputs:"
  log "  ${OUTDIR}/raw/${SAMPLE}.megares.tab"
  log "  ${OUTDIR}/raw/${SAMPLE}.card.tab"
  log "  ${OUTDIR}/raw/${SAMPLE}.resfinder.tab"
  log "  ${OUTDIR}/raw/${SAMPLE}.vfdb.tab"
  log "Log:"
  log "  ${LOGFILE}"
}

main

**

Processing Data_Benjamin_DNAseq_2026_GE11174

core_tree_like_fig3B

  1. Download the kmerfinder database: https://www.genomicepidemiology.org/services/ –> https://cge.food.dtu.dk/services/KmerFinder/ –> https://cge.food.dtu.dk/services/KmerFinder/etc/kmerfinder_db.tar.gz

    # Download 20190108_kmerfinder_stable_dirs.tar.gz from https://zenodo.org/records/13447056
  2. Run nextflow bacass

    #–kmerfinderdb /path/to/kmerfinder/bacteria.tar.gz #–kmerfinderdb /mnt/nvme1n1p1/REFs/kmerfinder_db.tar.gz #–kmerfinderdb /mnt/nvme1n1p1/REFs/20190108_kmerfinder_stable_dirs.tar.gz nextflow run nf-core/bacass -r 2.5.0 -profile docker \ –input samplesheet.tsv \ –outdir bacass_out \ –assembly_type long \ –kraken2db /mnt/nvme1n1p1/REFs/k2_standard_08_GB_20251015.tar.gz \ –kmerfinderdb /mnt/nvme1n1p1/REFs/kmerfinder/bacteria/ \ -resume

  3. KmerFinder summary

    From the KmerFinder summary, the top hit is Gibbsiella quercinecans (strain FRB97; NZ_CP014136.1) with much higher score and coverage than the second hit (which is low coverage). So it’s fair to write:

    “KmerFinder indicates the isolate is most consistent with Gibbsiella quercinecans.”

    …but for a species call (especially for publication), you should confirm with ANI (or a genome taxonomy tool), because k-mer hits alone aren’t always definitive.

  4. Using https://www.bv-brc.org/app/ComprehensiveGenomeAnalysis to annotate the genome using scaffolded results from bacass. ComprehensiveGenomeAnalysis provides comprehensive overview of the data.

  5. Generate the Table 1 Summary of sequence data and genome features under the env gunc_env

    activate the env that has openpyxl

    mamba activate gunc_env mamba install -n gunc_env -c conda-forge openpyxl -y mamba deactivate

    STEP_1

    ENV_NAME=gunc_env AUTO_INSTALL=1 THREADS=32 ~/Scripts/make_table1_GE11174.sh

    STEP_2

    python export_table1_stats_to_excel_py36_compat.py \ –workdir table1_GE11174_work \ –out Comprehensive_GE11174.xlsx \ –max-rows 200000 \ –sample GE11174

    STEP_1+2

    ENV_NAME=gunc_env AUTO_INSTALL=1 THREADS=32 ~/Scripts/make_table1_with_excel.sh

    #For the items “Total number of reads sequenced” and “Mean read length (bp)” pigz -dc GE11174.rawreads.fastq.gz | awk ‘END{print NR/4}’ seqkit stats GE11174.rawreads.fastq.gz

  6. Antimicrobial resistance gene profiling and Resistome and Virulence Profiling with Abricate and RGI (Reisistance Gene Identifier)

    Table 4. Specialty Genes

    Source Genes

    NDARO 1

    Antibiotic Resistance CARD 15

    Antibiotic Resistance PATRIC 55

    Drug Target TTD 38

    Metal Resistance BacMet 29

    Transporter TCDB 250

    Virulance factor VFDB 33

    https://www.genomicepidemiology.org/services/

    https://genepi.dk/

    conda activate /home/jhuang/miniconda3/envs/bengal3_ac3 abricate –list #DATABASE SEQUENCES DBTYPE DATE #vfdb 2597 nucl 2025-Oct-22 #resfinder 3077 nucl 2025-Oct-22 #argannot 2223 nucl 2025-Oct-22 #ecoh 597 nucl 2025-Oct-22 #megares 6635 nucl 2025-Oct-22 #card 2631 nucl 2025-Oct-22 #ecoli_vf 2701 nucl 2025-Oct-22 #plasmidfinder 460 nucl 2025-Oct-22 #ncbi 5386 nucl 2025-Oct-22 abricate-get_db –list #Choices: argannot bacmet2 card ecoh ecoli_vf megares ncbi plasmidfinder resfinder vfdb victors (default ”).

    CARD

    abricate-get_db –db card

    MEGARes (automatically install, if error try MANUAL install as below)

    abricate-get_db –db megares

    MANUAL install

    wget -O megares_database_v3.00.fasta \ “https://www.meglab.org/downloads/megares_v3.00/megares_database_v3.00.fasta” #wget -O megares_drugs_database_v3.00.fasta \ “https://www.meglab.org/downloads/megares_v3.00/megares_drugs_database_v3.00.fasta

    1) Define dbdir (adjust to your env; from your logs it’s inside the conda env)

    DBDIR=/home/jhuang/miniconda3/envs/bengal3_ac3/db

    2) Create a custom db folder for MEGARes v3.0

    mkdir -p ${DBDIR}/megares_v3.0

    3) Copy the downloaded MEGARes v3.0 nucleotide FASTA to ‘sequences’

    cp megares_database_v3.00.fasta ${DBDIR}/megares_v3.0/sequences

    4) Build ABRicate indices

    abricate –setupdb

    #abricate-get_db –setupdb abricate –list | egrep ‘card|megares’ abricate –list | grep -i megares

    chmod +x run_resistome_virulome.sh ASM=GE11174.fasta SAMPLE=GE11174 THREADS=32 ./run_resistome_virulome.sh

    chmod +x run_resistome_virulome_dedup.sh ENV_NAME=/home/jhuang/miniconda3/envs/bengal3_ac3 ASM=GE11174.fasta SAMPLE=GE11174 THREADS=32 ./run_resistome_virulome_dedup.sh ENV_NAME=/home/jhuang/miniconda3/envs/bengal3_ac3 ASM=./vrap_HF/spades/scaffolds.fasta SAMPLE=HF THREADS=32 ~/Scripts/run_resistome_virulome_dedup.sh ENV_NAME=/home/jhuang/miniconda3/envs/bengal3_ac3 ASM=GE11174.fasta SAMPLE=GE11174 MINID=80 MINCOV=60 ./run_resistome_virulome_dedup.sh

    grep -vc ‘^#’ resistome_virulence_GE11174/raw/GE11174.megares.tab grep -vc ‘^#’ resistome_virulence_GE11174/raw/GE11174.card.tab grep -vc ‘^#’ resistome_virulence_GE11174/raw/GE11174.resfinder.tab grep -vc ‘^#’ resistome_virulence_GE11174/raw/GE11174.vfdb.tab

    grep -v ‘^#’ resistome_virulence_GE11174/raw/GE11174.megares.tab | grep -v ‘^[[:space:]]$’ | head -n 3 grep -v ‘^#’ resistome_virulence_GE11174/raw/GE11174.card.tab | grep -v ‘^[[:space:]]$’ | head -n 3 grep -v ‘^#’ resistome_virulence_GE11174/raw/GE11174.resfinder.tab | grep -v ‘^[[:space:]]$’ | head -n 3 grep -v ‘^#’ resistome_virulence_GE11174/raw/GE11174.vfdb.tab | grep -v ‘^[[:space:]]$’ | head -n 3

    chmod +x make_dedup_tables_from_abricate.sh OUTDIR=resistome_virulence_GE11174 SAMPLE=GE11174 ./make_dedup_tables_from_abricate.sh

    chmod +x run_abricate_resistome_virulome_one_per_gene.sh ENV_NAME=/home/jhuang/miniconda3/envs/bengal3_ac3 \ ASM=GE11174.fasta \ SAMPLE=GE11174 \ OUTDIR=resistome_virulence_GE11174 \ MINID=80 MINCOV=60 \ THREADS=32 \ ./run_abricate_resistome_virulome_one_per_gene.sh

    #ABRicate thresholds: MINID=70 MINCOV=50 Database Hit_lines File MEGARes 35 resistome_virulence_GE11174/raw/GE11174.megares.tab CARD 28 resistome_virulence_GE11174/raw/GE11174.card.tab ResFinder 2 resistome_virulence_GE11174/raw/GE11174.resfinder.tab VFDB 18 resistome_virulence_GE11174/raw/GE11174.vfdb.tab

    #ABRicate thresholds: MINID=80 MINCOV=60 Database Hit_lines File MEGARes 3 resistome_virulence_GE11174/raw/GE11174.megares.tab CARD 1 resistome_virulence_GE11174/raw/GE11174.card.tab ResFinder 0 resistome_virulence_GE11174/raw/GE11174.resfinder.tab VFDB 0 resistome_virulence_GE11174/raw/GE11174.vfdb.tab

    python merge_amr_sources_by_gene.py python export_resistome_virulence_to_excel_py36.py \ –workdir resistome_virulence_GE11174 \ –sample GE11174 \ –out Resistome_Virulence_GE11174.xlsx

    Methods sentence (AMR + virulence)

    AMR genes were identified by screening the genome assembly with ABRicate against the MEGARes and ResFinder databases, using minimum identity and coverage thresholds of X% and Y%, respectively. CARD-based AMR determinants were additionally predicted using RGI (Resistance Gene Identifier) to leverage curated resistance models. Virulence factors were screened using ABRicate against VFDB under the same thresholds.

    Replace X/Y with your actual values (e.g., 90/60) or state “default parameters” if you truly used defaults.

    Table 2 caption (AMR)

    Table 2. AMR gene profiling of the genome assembly. Hits were detected using ABRicate (MEGARes and ResFinder) and RGI (CARD). The presence of AMR-associated genes does not necessarily imply phenotypic resistance, which may depend on allele type, genomic context/expression, and/or SNP-mediated mechanisms; accordingly, phenotype predictions (e.g., ResFinder) should be interpreted cautiously.

    Table 3 caption (virulence)

    Table 3. Virulence factor profiling of the genome assembly based on ABRicate screening against VFDB, reporting loci with sequence identity and coverage above the specified thresholds.

  7. Generate phylogenetic tree

    export NCBI_EMAIL=”j.huang@uke.de” ./resolve_best_assemblies_entrez.py targets.tsv resolved_accessions.tsv

    Note the env bengal3_ac3 don’t have the following r-package, using r_env for the plot-step!

    #mamba install -y -c conda-forge -c bioconda r-aplot bioconductor-ggtree r-ape r-ggplot2 r-dplyr r-readr

    chmod +x build_wgs_tree_fig3B.sh export ENV_NAME=/home/jhuang/miniconda3/envs/bengal3_ac3 export NCBI_EMAIL=”j.huang@uke.de” ./build_wgs_tree_fig3B.sh

  8. DEBUG (recommended): remove one genome and rerun Roary → RAxML; Example: drop GCF_047901425.1 (change to the other one if you prefer).

    1.1) remove from inputs so Roary cannot include it

    rm -f work_wgs_tree/gffs/GCF_047901425.1.gff rm -f work_wgs_tree/fastas/GCF_047901425.1.fna rm -rf work_wgs_tree/prokka/GCF_047901425.1 rm -rf work_wgs_tree/genomes_ncbi/GCF_047901425.1 #optional

    1.2) remove from accession list so it won’t come back

    awk -F’\t’ ‘NR==1 || $2!=”GCF_047901425.1″‘ work_wgs_tree/meta/accessions.tsv > work_wgs_tree/meta/accessions.tsv.tmp \ && mv work_wgs_tree/meta/accessions.tsv.tmp work_wgs_tree/meta/accessions.tsv

    2.1) remove from inputs so Roary cannot include it

    rm -f work_wgs_tree/gffs/GCA_032062225.1.gff rm -f work_wgs_tree/fastas/GCA_032062225.1.fna rm -rf work_wgs_tree/prokka/GCA_032062225.1 rm -rf work_wgs_tree/genomes_ncbi/GCA_032062225.1 #optional

    2.2) remove from accession list so it won’t come back

    awk -F’\t’ ‘NR==1 || $2!=”GCA_032062225.1″‘ work_wgs_tree/meta/accessions.tsv > work_wgs_tree/meta/accessions.tsv.tmp \ && mv work_wgs_tree/meta/accessions.tsv.tmp work_wgs_tree/meta/accessions.tsv

    3) delete old roary runs (so you don’t accidentally reuse old alignment)

    rm -rf work_wgstree/roary*

    4) rerun Roary (fresh output dir)

    mkdir -p work_wgs_tree/logs ROARY_OUT=”work_wgstree/roary$(date +%s)” roary -e –mafft -p 8 -cd 95 -i 95 \ -f “$ROARY_OUT” \ work_wgs_tree/gffs/*.gff \

    work_wgs_tree/logs/roary_rerun.stdout.txt \ 2> work_wgs_tree/logs/roary_rerun.stderr.txt

    5) point meta file to new core alignment (absolute path)

    echo “$(readlink -f “$ROARY_OUT/core_gene_alignment.aln”)” > work_wgs_tree/meta/core_alignment_path.txt

    6) rerun RAxML-NG

    rm -rf work_wgs_tree/raxmlng mkdir work_wgs_tree/raxmlng/ raxml-ng –all \ –msa “$(cat work_wgs_tree/meta/core_alignment_path.txt)” \ –model GTR+G \ –bs-trees 1000 \ –threads 8 \ –prefix work_wgs_tree/raxmlng/core

    7) Run this to regenerate labels.tsv

    bash regenerate_labels.sh

    8) Manual correct the display name in vim work_wgs_tree/plot/labels.tsv

    #Gibbsiella greigii USA56 #Gibbsiella papilionis PWX6 #Gibbsiella quercinecans strain FRB97 #Brenneria nigrifluens LMG 5956

    9) Rerun only the plot step:

    Rscript work_wgs_tree/plot/plot_tree.R \ work_wgs_tree/raxmlng/core.raxml.support \ work_wgs_tree/plot/labels.tsv \ 6 \ work_wgs_tree/plot/core_tree_like_fig3B.pdf \ work_wgs_tree/plot/core_tree_like_fig3B.png

  9. fastaANI and busco explanations

    find . -name “*.fna” #./work_wgs_tree/fastas/GCF_004342245.1.fna GCF_004342245.1 Gibbsiella quercinecans DSM 25889 (GCF_004342245.1) #./work_wgs_tree/fastas/GCF_039539505.1.fna GCF_039539505.1 Gibbsiella papilionis PWX6 (GCF_039539505.1) #./work_wgs_tree/fastas/GCF_005484965.1.fna GCF_005484965.1 Brenneria nigrifluens LMG5956 (GCF_005484965.1) #./work_wgs_tree/fastas/GCA_039540155.1.fna GCA_039540155.1 Gibbsiella greigii USA56 (GCA_039540155.1) #./work_wgs_tree/fastas/GE11174.fna #./work_wgs_tree/fastas/GCF_002291425.1.fna GCF_002291425.1 Gibbsiella quercinecans FRB97 (GCF_002291425.1)

    mamba activate /home/jhuang/miniconda3/envs/bengal3_ac3 fastANI \ -q GE11174.fasta \ -r ./work_wgs_tree/fastas/GCF_004342245.1.fna \ -o fastANI_out_Gibbsiella_quercinecans_DSM_25889.txt fastANI \ -q GE11174.fasta \ -r ./work_wgs_tree/fastas/GCF_039539505.1.fna \ -o fastANI_out_Gibbsiella_papilionis_PWX6.txt fastANI \ -q GE11174.fasta \ -r ./work_wgs_tree/fastas/GCF_005484965.1.fna \ -o fastANI_out_Brenneria_nigrifluens_LMG5956.txt fastANI \ -q GE11174.fasta \ -r ./work_wgs_tree/fastas/GCA_039540155.1.fna \ -o fastANI_out_Gibbsiella_greigii_USA56.txt fastANI \ -q GE11174.fasta \ -r ./work_wgs_tree/fastas/GCF_002291425.1.fna \ -o fastANI_out_Gibbsiella_quercinecans_FRB97.txt cat fastANIout*.txt > fastANI_out.txt

    GE11174.fasta ./work_wgs_tree/fastas/GCF_005484965.1.fna 79.1194 597 1890 GE11174.fasta ./work_wgs_tree/fastas/GCA_039540155.1.fna 95.9589 1547 1890 GE11174.fasta ./work_wgs_tree/fastas/GCF_039539505.1.fna 97.2172 1588 1890 GE11174.fasta work_wgs_tree/fastas/GCF_004342245.1.fna 98.0889 1599 1890 GE11174.fasta ./work_wgs_tree/fastas/GCF_002291425.1.fna 98.1285 1622 1890 #在细菌基因组比较里,一个常用经验阈值是:

    • ANI ≥ 95–96%:通常认为属于同一物种(species)的范围
    • 你这里 97.09% → 很大概率表示 An6 与 HITLi7 属于同一物种,但可能不是同一株(strain),因为还存在一定差异。 是否“同一菌株”通常还要结合:
    • 核心基因 SNP 距离、cgMLST
    • 组装质量/污染
    • 覆盖率是否足够高

    #BUSCO 结果的快速解读(顺便一句). The results have been already packaged in the Table 1.

    • Complete 99.2%,Missing 0.0%:说明你的组装非常完整(对细菌来说很优秀)
    • Duplicated 0.0%:重复拷贝不高,污染/混样风险更低
    • Scaffolds 80、N50 ~169 kb:碎片化还可以,但总体质量足以做 ANI/物种鉴定
  10. fastANI explanation

From your tree and the fastANI table, GE11174 is clearly inside the Gibbsiella quercinecans clade, and far from the outgroup (Brenneria nigrifluens). The ANI values quantify that same pattern.

1) Outgroup check (sanity)

  • GE11174 vs Brenneria nigrifluens (GCF_005484965.1): ANI 79.12% (597/1890 fragments)

    • 79% ANI is way below any species boundary → not the same genus/species.
    • On the tree, Brenneria sits on a long branch as the outgroup, consistent with this deep divergence.
    • The relatively low matched fragments (597/1890) also fits “distant genomes” (fewer orthologous regions pass the ANI mapping filters).

2) Species-level placement of GE11174

A common rule of thumb you quoted is correct: ANI ≥ 95–96% ⇒ same species.

Compare GE11174 to the Gibbsiella references:

  • vs GCA_039540155.1 (Gibbsiella greigii USA56): 95.96% (1547/1890)

    • Right at the boundary. This suggests “close but could be different species” or “taxonomy/labels may not reflect true species boundaries” depending on how those genomes are annotated.
    • On the tree, G. greigii is outside the quercinecans group but not hugely far, which matches “borderline ANI”.
  • vs GCF_039539505.1 (Gibbsiella papilionis PWX6): 97.22% (1588/1890)

  • vs GCF_004342245.1 (G. quercinecans DSM 25889): 98.09% (1599/1890)

  • vs GCF_002291425.1 (G. quercinecans FRB97): 98.13% (1622/1890)

These are all comfortably above 96%, especially the two quercinecans genomes (~98.1%). That strongly supports:

GE11174 belongs to the same species as Gibbsiella quercinecans (and is closer to quercinecans references than to greigii).

This is exactly what your tree shows: GE11174 clusters in the quercinecans group, not with the outgroup.

3) Closest reference and “same strain?” question

GE11174’s closest by ANI in your list is:

  • FRB97 (GCF_002291425.1): 98.1285%
  • DSM 25889 (GCF_004342245.1): 98.0889%
  • Next: PWX6 97.2172%

These differences are small, but 98.1% ANI is not “same strain” evidence by itself. Within a species, different strains commonly sit anywhere from ~96–99.9% ANI depending on diversity. To claim “same strain / very recent transmission,” people usually look for much tighter genome-wide similarity:

  • core-genome SNP distance (often single digits to tens, depending on organism and context)
  • cgMLST allele differences
  • recombination filtering (if relevant)
  • assembly QC/contamination checks
  • and confirming that alignment/ANI coverage is high and not biased by missing regions

Your fragment matches (e.g., 1622/1890 for FRB97) are reasonably high, supporting that the ANI is meaningful, but it still doesn’t equate to “same strain.”

4) How to phrase the combined interpretation (tree + ANI)

A clear summary you can use:

  • The phylogenetic tree places GE11174 within the Gibbsiella quercinecans lineage, with Brenneria nigrifluens as a distant outgroup.
  • fastANI supports this:

    • ~98.1% ANI to G. quercinecans FRB97 and DSM 25889 → strong same-species support.
    • 97.2% to G. papilionis → still same-species range, but more distant than the quercinecans references.
    • 95.96% to G. greigii → borderline; consistent with being a close neighboring lineage but not the best species match for GE11174.
    • 79.1% to Brenneria → confirms it is an appropriate outgroup and far outside the species/genus boundary.
  • Therefore, GE11174 is very likely Gibbsiella quercinecans (species-level), and appears most similar to FRB97/DSM 25889, but additional high-resolution analyses are required to assess “same strain.”

If you want, paste the fastANI line for GE11174 vs GE11174? (or run fastANI against a broader set), and/or your core SNP distances from the Roary alignment, and I can help you write a tighter “strain-level” conclusion.

#TODO_NEXT_MONDAY: * phylogenetic tree + fastaANI + nf-core/pairgenomealign (compare to the closest isoalte https://nf-co.re/pairgenomealign/2.2.1/)

            * #summarize all results with a mail to send them back, mentioned that we can submit the genome to NCBI to obtain a high-quality annotation. What strain name would you like to assign to this isolate?

            * If they agree, I can submit the two new isolates to the NCBI-database!
  1. submit both sequences in a batch to NCBI-server!

  2. Find the more “closest isolate” from GenBank (robust approach) for STEP_7

    4. Find the more “closest isolate” from GenBank (robust approach)

    # download all available genomes for the genus Gibbsiella (includes assemblies + metadata)
    #--assembly-level" flag: must be 'chromosome', 'complete', 'contig', 'scaffold'
    datasets download genome taxon Gibbsiella --include genome,gff3,gbff --assembly-level complete,chromosome,scaffold --filename gibbsiella.zip
    unzip -q gibbsiella.zip -d gibbsiella_ncbi
    
    mamba activate /home/jhuang/miniconda3/envs/bengal3_ac3
    
    # make a Mash sketch of your isolate
    mash sketch -o isolate bacass_out/Unicycler/GE11174.scaffolds.fa
    
    # sketch all reference genomes (example path—adjust)
    find gibbsiella_ncbi -name "*.fna" -o -name "*.fasta" > refs.txt
    mash sketch -o refs -l refs.txt
    
    # get closest genomes
    mash dist isolate.msh refs.msh | sort -gk3 | head -n 20 > top20_mash.txt
    
    ## What your Mash results mean
    
    * The **best hits** to your assembly (`GE11174.scaffolds.fa`) are:
    
      * **GCA/GCF_002291425.1** (shows up twice: GenBank **GCA** and RefSeq **GCF** copies of the *same assembly*)
      * **GCA/GCF_004342245.1** (same duplication pattern)
      * **GCA/GCF_047901425.1** (FRB97; also duplicated)
    * Mash distances around **0.018–0.020** are **very close** (typically same species; often same genus and usually within-species).
    * The `0` in your output is just Mash’s p-value being printed as 0 due to underflow (i.e., extremely significant).
    
    So yes: your isolate looks **very close to those Gibbsiella genomes**, and FRB97 being in that set is consistent with your earlier KmerFinder result.

    5. — Remove duplicates (GCA vs GCF)

    Right now you’re seeing the same genome twice (GenBank + RefSeq). For downstream work, keep one.
    
    Example: keep only **GCF** if available, else GCA:
    
    ```bash
    # Take your top hits, prefer GCF over GCA
    cat top20_mash.txt \
      | awk '{print $2}' \
      | sed 's|/GCA_.*||; s|/GCF_.*||' \
      | sort -u
    ```
    
    But easiest: just manually keep one of each pair:
    
    * keep `GCF_002291425.1` (drop `GCA_002291425.1`)
    * keep `GCF_004342245.1`
    * keep `GCF_047901425.1`
      (and maybe keep `GCA_032062225.1` if it’s truly different and you want a more distant ingroup point)

单车杀王

单车杀王”通常指国际象棋残局里的“一车(Rook)+一王 对 单王”的基本将死法(英文常叫 rook checkmate / rook and king vs king)。(Chess.com)

下面给你一个最稳的“口诀式”走法框架(学会这套,基本不会迷路):

单车杀王的核心思路:关进盒子 → 国王赶上来 → 收网将死

  1. 车先“画盒子” 用你的车把对方王限制在棋盘的一块区域里(比如先把它关在上半场/左半场)。车通常放在离对方王隔一格以上的位置,避免被王直接吃掉。

  2. 己王靠近来“护车” 单靠车不能把盒子越缩越小,因为你一缩,车可能会被对方王追着攻击;所以必须让己王走近,形成“车在前、王在后保护”的队形。

  3. 缩盒子:车走到更近的一条线 当你的王已经足够近、能保护车时,用车把对方王的活动范围再缩小一圈(例如从“关在3行”缩到“关在2行”)。

  4. 重复 2) + 3) “王跟上 → 车再缩”,循环几次,就能把对方王逼到边线/角落。

  5. 最后将死的典型形状

  • 对方王在边线(例如 a 线),你的车给将并封住整条边线;
  • 你的王站在车旁边(或能保护车的位置),让对方王既吃不了车、又无格可逃,完成将死。(Chess.com)

两个常见坑(新手最容易翻车)

  • 车贴太近被吃:车别和对方王“肩并肩”,至少隔一格。
  • 被逼得不断“横着给将”但不推进:只用车不停将军,可能拖到和棋(50步规则等),关键是让己王参与收网