Author Archives: gene_x

博士求职防割韭菜指南:那些花里胡哨的助理研究员、特聘研究员、专职研究员……到底是啥?

https://www.163.com/dy/article/G7LQR1V105349YKB.html

我国高等教育改革方兴未艾,各种名目的岗位层出不穷。

  不少博士毕业后懵懵懂懂就签了卖身契,等反应过来才追悔莫及。说实话,面对如此之多的岗位,要想理出个头绪真的很难:

  同样名称的岗位在不同学校可能完全不一样,不同名称的岗位在不同学校可能完全一样,甚至同样名称的岗位在同一个学校都会慢慢走样。

  有鉴于此,我们整理了一份不算全面的博士就业岗位说明,主要区分不同类别的博士后和高校岗位,帮助求职的博士们了解真相,有效避坑。

  01 博士后岗位

  很多人读完博士,就习惯性地想找个博后位置做一站。因为近期一些众所周知的原因,出国做博后可能比以前难了。国内的博后系统比国外复杂,不少想在国内做博后的人还有点搞不清楚。今天咱们就来捋一捋。

  我国博士后制度还不到四十年历史,最初由诺奖得主、华裔物理学家李政道倡议。1985 年 7 月,国务院批准设立博士后科研流动站。我国博士后制度正式确立。

  放眼海内外,博士后岗位是在高等学府或科研机构,或大型企业、高新技术企业、留学人员创业园或科研生产型事业单位设置的一些需要高学历人才的特殊短期职位,在招聘时专门针对博士,要求在规定的期限内从事具体的科学研究。博士后只是一种工作经历,不是专业学位也不是行政职务。

  博士后的工作期限多数为两年,少数可以延长到三年。按照国家的要求,博士后是设站单位的正式员工,但不列入编制,工作期满后必须流动出站。

  一、流动站博后

  1. 岗位简介

  正是因为博士后们在找到固定的工作岗位前实际上处于流动状态,所以最早容纳博士后进行科研的机构叫流动站。流动站是建立在高校或科研院所的某个一级学科之下的管理博士后的机构,也是我国博后群体的主流。

  2. 薪资待遇

  大部分高校为博后提供的待遇还不错,的确是向本校正式员工看齐,一般包括基本薪资、科研绩效津贴与专项经费、住房补贴与子女入托入学等。

  博士后制度建立初期,薪资曾一度非常可观,据说达到当时教授的水平。后来形势急转直下,性价比很低。这导致很大一部分中国博士出国做博后,不利于人才回流。最近十年,在多方呼吁之下,博后待遇有了飞速提高,最高 60 万,平均 30 万。

  某研究所博后招聘的支持政策

  不过,这钱可未必全是学校出。博士后有国家资助,每人两年 10 万;很多省市有专门针对博后的人才政策,给钱不说,还专门修建人才公寓,解决住房问题;学校肯定也要给一部分,解决一部分福利待遇;课题组通常还能再给点。几处相加,的确实惠。

  3. 绩效要求

  博后的工作也有绩效要求,基本都体现在出站要求上。博士后是专门从事科研的短期工作人员,无论是否在职,是否脱产,都不需要承担教学工作。

  说到底,绩效考核还是承担怎样的项目,发表多少论文。随着博士毕业人数的增加,博后岗位也加速内卷,出站要求水涨船高,有良心的学校还知道跟着提高配套的绩效,没良心的学校直接玩一手「退站就退钱」的操作,空手套白狼地割韭菜。

  4. 发展前景

  虽然博士后并非教职,但不少人还是把博士后看作学术生涯的起点,也确实有很多博士的独立科研正是从博后阶段开始。而且我国博士后由全国博士后管理委员会(博管办)统一管理,不是学位胜似学位。

  所以,博士后做得好坏与否,对一个博士的学术生涯而言还是很重要的。现实中也有许多博士阶段工作平平,博后阶段突然开窍而后谋得高校教职的例子。

  二、工作站博后

  1. 岗位简介

  说博后阶段很重要的前提,仅指流动站,不包括工作站。

  工作站是在企业、科研生产型事业单位或特殊的区域性机构内设立的管理博士后的机构,虽然 1997 年才开始推行,但以目前数量来看,比流动站还多,是不少博士毕业后的选择。

  工作站还解决了流动站无法解决的「同站同学科」难题,允许高校毕业的博士进入工作站的同一个一级学科开展研究。

  工作站博后的工作期限一般为两年,也有因研究项目进度等原因需要延长工作期限的情况。企业在用人方面操作很灵活,有的也未必是按照博后身份延期,可以按照新的劳动合同延期。

  图片来源:阿里巴巴官网

  2. 薪资待遇

  工作站博后的薪资和待遇由本单位承担,实际收入与企业同岗位、同资历工作人员的收入相当,也可以有地方政府人才政策的额外加成。在站博后也能同本单位正式职工一样享受其他各项生活福利待遇,包括奖金、公费医疗、困难补助、探亲、书报补贴等。由于能设立工作站的企业多数资质不错,也愿意拿钱留人,所以工作站博后的到手工资还是有竞争力。

  3. 绩效要求

  工作站博后也有出站要求,具体由所在单位决定。因为工作站博后通常要兼顾工作任务与学术要求,而且普遍来讲,企业的科研平台不如高校和研究所,整体的研究氛围也与高校无法相比,因此出站要求偏低。

  这就导致履历上工作站博后的含金量不高,后续求职中可能会被人另眼相看。

  4. 发展前景

  人才和学历是企业的名片,不排除有些企业设立工作站,就是为了作秀甚至骗钱,博士们去了可就惨了。

  另外还有些企业的博后管理比较落后,博后需要帮助的时候无人支援,一旦有麻烦出现,倒是推诿扯皮,磨蹭拖拉。

  更可恶的是,有些企业还把博后申请到的经费据为己有,导致博后根本无法开展研究工作。

  最惨的是,一旦这些坑人企业被查,博后工作站被撤,在站博后分分钟面临失业的境地。所以,去之前一定要考察好。

  图片来源:中国政府官网

  三、师资博后

  1. 岗位简介

  师资博后是中国特色高等教育的产物之一。

  进入新世纪,在高校扩招和高等教育飞速发展的背景下,优质师资不足日益凸显。为了提高青年教师的学术水平,把新晋教师的考察期延长,浙江大学于 2005 年率先推出「师资博士后」政策,将部分博士后纳入师资队伍管理。

  师资博后比普通博后的准入门槛高,需要同时申请博士后和应聘讲师。入站后,除了承担科研工作之外,也需要承担部分教学任务,为未来执教提前准备。

  2. 薪资待遇

  师资博后的合同一般还是两年,在站期间享有国家规定的薪资待遇和福利,少数高校会象征性给点科研启动金,但与之匹配的,出站要求则比流动站博后更高 —— 否则,凭啥留下做教师呢?

  某高校师资博后待遇,图片来源:某高校官网

  3. 绩效要求

  历史地看,当年浙大推出师资博后,属于一种对 tenure-track 的探索。本质上还是遴选优秀的博士作为教师预备队,如果绩效不达标,则两年到期后出站,不必继续占用本校教师名额。

  现在,各种明目的专职科研岗位在中华大地全面开花,师资博后再也不是建设教师队伍,提高科研绩效的最好选择。

  但不得不说的是,跟现在那些割韭菜的专职科研岗位相比,浙大还是很厚道的 —— 在推行师资博后的四年中,有大约七成的师资博后出站后成为浙大正式教师,兑现率很高。

  4. 发展前景

  当然,也有些高校的师资博后在如今内卷的大环境下变得非常坑,出站绩效要达到副教授甚至教授的水平,出站报告要以博士论文的标准来做。内卷之下,无一幸免,师资博后,留任无望,最后也免不了被割一把韭菜。

  四、国外博后

  1. 岗位简介

  跟国内的博后比起来,国外的博后就很简单了…… 除了名字之外。

  在国外,博后就是跟着项目走的短期高级雇员,只要钱足够,头衔可以随便给:

  Postdoctoral、Postdoctoral Scientist、Postdoctoral Fellow、Postdoc Associate、Postdoctoral Researcher、Postdoctoral Scholar、Postdoc Research Scientist、Research Scientist、(Senior) Research Fellow、(Senior) Research Associate……

  统统都可以是博后。

  美国疾控中心博后招聘

  2. 薪资待遇

  高级雇员也只是打短工,工作期限一般是两年,年薪多数在四五万美元。其实这都不重要,老板喜欢你,自然会加薪,重新签合同都行;老板不喜欢你,提前通知一下就能炒你鱿鱼。

  所谓的商业社会注重契约精神,前提还得是老板说了算。谁让钱都是老板给呢!

  正因如此,也就不存在实质上的博士后管理部门,什么科研启动金、人才政策、子女入托等福利更是没有 —— 国外压根儿不来这一套。

  3. 绩效要求

  因为老板说了算,考核什么的也就完全没有硬性标准,端的是「说你行你就行,不行也行;说不行就不行,行也不行」。有钱就是大爷,学术界也未能幸免啊~

  4. 发展前景

  从职业发展来看,如果能到国外大牛组里做博后还是非常好的,这可是闪闪发光的学术履历。好好表现,走的时候带走大牛一封有力的推荐信,那可是求职的杀手锏。要是还能发几篇顶刊论文,那几乎可以预定个「X 人计划」了。

  总结

  博后是科研生涯的踏板,选错博后而误终生的例子实在太多,一定要慎重选择。特别要注意协议文本,不要只关注进站的条件和待遇,更要仔细弄清出站的要求和退站的约定,免得白花自己时间精力,还惹一肚子气。

  02 高校岗位

  在谈高校岗位前,先祭出高校教师打怪升级图。

  

  图片来源:作者

  没有任何独立科研经验的博士,想拿教授是很根本不可能的,一般给个讲师,个别优秀的才能拿到副教授。现在不少高校已经取消了讲师的招聘,代之以名目繁多的不给编制的新头衔。这里面坑可不少,博士们在找工作的时候千万要注意加以分辨。

  一、讲师

  1. 岗位简介

  助教不能独立开课,讲师能够独立开设一门或一门以上的课程。高校默认博士们具有独立开课的能力,起点至少是讲师。

  2. 薪资待遇

  目前大多数高校的讲师都没有编制,合同期限长短得看协议。

  讲师的薪水都是高校给,一般底薪很低,每月小几千块 —— 这个没办法,按照国家标准来嘛 —— 但好在可以拿课时费,等于是绩效部分。不同高校的课时费多寡不同,少的有几十块的,多的有三位数的。

  讲师一般不按人才引进,但博士去做讲师,科研启动金和安家费可能都会有一点。社保转到当地的话,也能享受人才政策,政策力度各地不同。讲师是高校的正式员工,子女入托等福利还是有的,再多的就不要想了。

  也不排除有待遇很丰厚的讲师岗位,需要博士们在找工作的时候仔细寻觅。

  某高校招聘讲师提供的待遇,图片来源:某高校官网

  3. 绩效要求

  理论上讲师只要讲好课就可以,但为了打怪升级职业发展,讲师还是得搞科研,不然怎么评副教授呢?具体的职称评审要求,各个高校不同。

  需要注意的是,讲师是没办法独立建课题组的,文科的自己做研究也能将就,理工科需要做实验的话,多数会选择个大牛的组当个小老板,帮大牛带带学生,出的成果算自己一份儿,将来评职称可以用。

  二、副教授 / 副研究员

  1. 岗位简介

  少数高校按照人才引进优秀博士或博士后会直接给副教授 / 副研究员,有些还会很大方地直接给编制。这种岗位是每个想进高校的博士梦寐以求的。

  2. 薪资待遇

  如果有了编制,合同期限就只是个形式,不重要。薪水也是高校给,能比讲师多一些,课时费也比讲师高。绩效部分,因为承担科研项目,所以项目结题和发表论文都会有奖励。

  按照人才引进,科研启动金和安家费是必须得有的,不同地区不同高校在引进人才的这两笔钱上差距可能有十倍。看到别人的多,咱也别眼热。

  因为有编制,社保是必须转到当地的,自然也可以享受当地的人才政策 —— 多数还不是最低的一档。副教授 / 副研究员也是高校的正式员工,子女入托等福利一应俱全,还有其他明的暗的福利。

  某高校招聘讲师 / 副教授的公告,图片来源:某高校官网

  3. 绩效要求

  高校能给副高,肯定是希望博士们继续搞科研出成果的,所以科研绩效定得比较高。这也符合绝大多数博士们的意愿,双方利益是统一的。

  但这里有个问题,有了副高不一定是硕导,就算能独立建组,也未必能招到研究生。有的高校采取 PI 制(Principal Investigator)来协调,理论上 PI 是跟项目走的,项目的负责人就是 PI,但这还是难以解决国内研究生导师需要一定资质的问题。

  一种做法是让研究生统一挂名某个院内大牛,各个副教授 PI 完全独立带学生。当然,有的高校嫌麻烦,就让这些副教授们像讲师那样进大老板的组,但好处是搞联合培养啥的比讲师容易。

  三、特聘教授 / 副教授 or 特聘研究员 / 副研究员 or 专职研究员 / 副研究员

  1. 岗位简介

  古语有云:同进士不同进士,如夫人不如夫人。可见头衔加前缀,降低含金量。这个道理到今天也一样。

  甭管「特聘」还是「专职」,都是游离在高校岗位之外的,无非是为了对头衔做个区分,区分的重点在于:没有编制!

  某高校招聘特聘研究员 / 副研究员,图片来源:某高校官网

  2. 薪资待遇

  普遍意义上,没有编制就是临时工。对临时工而言,合同就很重要了。这些专门为研究而设立的岗位,一般是一份合同签两段相同年限的聘期,比如来个「3+3」。不要觉得六年时间还挺长,可以骑驴找马,慢慢找其他机会。要知道,某些高校的急功近利很刷下限。

  薪水方面,一般的临时工,工资是比正式员工少的,但特聘或专职岗位经常待遇还算过得去,一年怎么也得有个二三十万 —— 读了个博士,总是管点用。

  工资理论上都是学校给,但上面也说了,得看具体的协议怎么写的。但无论怎么写,都不是按照人才引进,科研启动金和安家费几乎是确定没有的,能保证纸面上的薪资不是打包的用人成本就已经很不错了。

  因为没编制,社保啥的不重要,人才政策啥的也没人替博士们张罗。虽然名曰高校正式职工,但子女入托等福利落实得比较少,个别厚道点的高校才会解决一部分。

  某高校招聘特聘研究员 / 副研究员,图片来源:某高校官网

  3. 绩效要求

  既然是高校专门设立岗位忽悠招聘来的,那目的就很明确了 —— 做科研。这类专门的岗位不能独立建组,只能找个大老板的组干活儿,通常也不必操心教学的事,就是一门心思弄项目发论文即可。在这点上,高校、课题组老板、特聘副教授们的需求是一致的。

  4. 发展前景

  照说,就这么闷头苦干多发论文,职业发展应该不错吧?那得看在哪里。在原高校想留下那是基本没机会的,但要是借着这段时间发表的论文,去其他地方求职,职业发展也有可能不错。注意!只是有可能哦~~坊间传言,有的高校已经开始不再招聘做过专职科研岗位的博士。

  四、助理教授

  1. 岗位简介

  助理教授是高校人事招聘中最近兴起的舶来品。美国的助理教授叫 Assistant Professor,比副教授(Associate Professor)和教授(Full Professor)低。在 tenure-track 体系中,是打怪升级的起点。

  引入国内后,目前有点混乱,既可以是讲师,也可以是特聘研究员,反正就是没编制。

  2. 薪资待遇

  虽说没编制,但学术起点可是相当好。助理教授薪资不错,据说看齐正教授,也有安家费和科研启动金 —— 都是学校给的。学校也会尽力申请,把地方上的人才政策落实到位。科研启动金往往很可观,多到能独立建实验室。

  前面提到的 PI 制跟助理教授的配合也比较好,比如在改革开放试验田的某高校,新引进的助理教授是 PI,也是博士生导师,能招研究生,还能招研究助理和博士后。这基本就是美国的整套玩法了。搞不好比美国还好,因为子女入托等高校福利,国外一般没有,可咱们有!

  图片来源:深圳大学官网

  3. 绩效要求

  考虑到助理教授也没编制,所以合同很重要。能干几年,什么标准能评副教授,最好都写进协议。助理教授需要承担的教学任务通常很少,安心做研究达到绩效就好。

  4. 发展前景

  从目前国内高校的情况看,助理教授还不算坑,好好干的话,留下的可能性很大。毕竟学校已经花了这么多资源为助理教授独立建组,要是聘期结束就赶人,沉没成本也忒大了点儿。但不排除某些高校在用人方面采取创新,继续把助理教授这个头衔玩儿坏。

  总结

  以上的信息,都只是大面上的粗略的介绍,跟具体高校的具体岗位肯定有出入。总结起来,高校的岗位,没编制的多,有编制的少。对于那些没编制的岗位,还要看用人单位的招聘说明。

  最后,需要提醒每一位博士的是:入职前一定要仔细看合同,对于薪资待遇、绩效要求、考核标准、晋升条件等要一再确认,最好都落实到纸面。

  —— 哪怕有些流氓高校朝令夕改,咱最少还有个依据不是?

当前博士入职高校任教究竟有多难?

我们来看一组数据:

2019年国家统计局年鉴显示,国内应届博士毕业生已经超过6万,加上海归,现在总人数突破10万是必然趋势。而普通高等学校正以每年3万~4万人的增量吸收专任教师岗位人才。这意味着有30%左右的博士将注定被挡在高校正职门外,剩下的大部分还要经过残酷、甚至惨烈的层层选拔,才有机会成为最终的“胜利者”。

面对僧多粥少、难求一职的尴尬现状,“十万博士大军”开始寻求“曲线救国”的方式,比如通过博士后这一特殊职位,先设法迈进高校的门槛,再以此为跳板,一步步挤进“正规军”行列。但事实上,如此美好的期待背后,往往隐藏着危机,稍不留神就会掉入大坑……

常见的博后类型有哪些?

关于博后的认知,门外汉还停留在“这是比博士更牛的学历”的误区里。而学术圈内的在读博士,虽然知道博后是职位,却也难以辨别其光鲜面具下的真实样貌。

在我国现有的在编高校教职序列中,博后本无姓名。换句话说,博后的存在是学校为缓和“教职编制供不应求和需要引进高层次人才”这一矛盾的应变策略。当然,不设上限数量的编外博后职位,确实也有效减缓着就业压力,同时培育着科研潜力股。

目前,博后职位逐渐衍生出以下三种类型:

  • 项目博后。顾名思义,入站就要为一个具体项目组的科学研究付出努力,待遇比普通博后略高,但项目结束之日就是你出站之时,完全没有留校机会。

  • 科研博后。类似于专职科研岗,这是高校争抢优质人才的又一手段,他们往往以高薪吸引基础扎实的博士毕业生进入学校工作,专职科研而不安排教学任务,目的就是能快出成果,冲刺“四青”。比较扎心的是,科研博后同样没有编制,3年一聘用,面临非升即走的困境。它的留校要求也非常苛刻,而且副教授名额有限,竞争压力极大。不过如果个人能力足够突出,仍然能出好成果,发好论文。某种程度上还是可以为自己积累跳槽的筹码。

  • 师资博后。这大概是博后中的“天坑职位”,所谓师资,本质上就是高校想招聘老师但是又没有编制名额,所以暂时用一部分博后指标先将博士招聘进来,教学、科研“双肩挑”,服务期大约2~3年不等。

要求时时变,待遇别期待

之所以说师资博后是“天坑”职位,不仅仅因为其需要背负“教学+科研”两座大山,更重要的是摇摆不定的政策令人头大,感觉自己就是任人宰割的小羔羊,毫无招架之力。

比如,谈及科研要求,起初承诺发1篇核心期刊论文即可拥有留校资格;没做多久,各类领导又婉转表达,必须拿到省部级课题+论文才能聘讲师;好吧,努努力也不是不可能,未曾想省部级课题申请书还没交上去,新领导上任,考核条件又提升成国家级课题+指定刊物论文。

这时你发现,这应该是该副教授的标准……那么问题来了,合同签了,其他offer推了,招聘季过了,陷入骑虎难下的处境。一怒之下闪出辞职念头,却又望着高昂的离职赔偿金,瞬间只好在心里弱弱哀叹——终究是错付了。

此外,师资博后可以说“付出和回报不成正比”。除了教学和科研,有时还要接受许多杂活儿,没有边界感;而且由于不是学校的“自己人”,福利待遇也一言难尽,基本课时费和工资或许比讲师高,但是五险一金、年终奖、节日奖等其他待遇是否一视同仁就不敢保证了。

总之,与项目博后、科研博后的明码标价不同,师资博后到处都是“灰色地带”,应届博士小白容易被看似充满希望的表象所迷惑,最后为自己的too young too simple 买单。

留校需要“硬实力+好圈子”

在高校内卷没有这么惨烈的从前,合同上所谓的“两年后经考核达到XX条件即可留校聘为讲师”的承诺通常还生效。但随着科研界通货膨胀的日益严重 ,实际上已经不可信了。大多数博士都是服务期内达不到科研要求,遗憾出局。正如某论坛上的段子所言,“两年过后,带着一身‘007’的伤痛,你独自一人黯然离去,而学校的科研GDP越来越红火”。

说到这,有些大佬可能会以切身经历站出来反驳。的确,师资博后虽然槽点多多,但也不排除“牛人趟出一条血路”的概率。毕竟,理论上师资博后的培养方向是“有编制、出成果、能授课”的满分人才。如果你满足以下几个条件,倒也不妨一试。

  • 成果数量多,代表作不多。此类博士不在少数,由于学历背景、年龄限制、海外经历等等原因,难以通过高校教职的选拔,但研究生阶段科研基础扎实,产出过一系列普通论文,同时对自己的研究方向具有深入见解和前瞻眼光。这样的潜力股保持住干劲,假以时日,确实有机会以师资博后为跳板,转正获得编制。

  • 学术人脉资源过硬。想要拥有光明的学术前景,除了刻苦努力,也讲究“圈子”支持,甚至有时候后者加持更重要。如果能跟上好老板、好导师,加上自身水平不差,肯付出被赏识,未来发展基本就稳了。

  • 做好兼顾教学和科研的心理准备。拥有超强的时间分配管理能力,既能保证高质量完成科研工作,还能抽出本就不多的休闲时间用来备课教学,精力MAX且责任心plus。

凡事无绝对,是否选择师资博后作为第一份工作,因人而异。

对于学历出身好、代表作不少的博士应届生,千万别因为眼前的诱惑而草草交待,还是耐心等待,投递教职岗位,因为出站“再就业”是大概率事件,到时候这段博后经历可能是减分项,就像“本科北大,硕士被调剂去了五道口技术学院”;对于综合实力一般的博士,可以选择海外博后历练一番,也可以在权衡利弊后入职师资博后,只要下定决心、耐住磨砺,逆袭也不是梦,就像“专科一路考上北大,特别难,但也总有人创造奇迹”。

关于师资博后,各高校情况不同,tips不一定放之四海皆准,但能有效避免小白傻傻“入坑”、追悔莫及。

尾声

博士求职的高压越来越令人窒息不假,但也千万别因此抱着“赌一把”的心态随意签下“卖身契“。求职关键期里:首先锁定教职岗位;经济条件允许的可以考虑海外博后;而一旦选择了国内师资博后等博后职位,要心中有数且想好后路!

YopQ Secretion Boxplot and Fitting Function

YopQ_Secretion_Barplot

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
import seaborn as sns

# Sample data
data = {
    "Sample": ["1 (04.10.23)", "2 (27.20.23)", "3 (30.11.23)", "1 (SR003)", "2 (SR004)", "3 (SR021)"],
    "YopQ1-5": [0.17864353, 0.08980754, 0.13318754, 0.039536, 0.07075792, 0.04852611],
    "YopQ1-10": [0.3663838, 0.45519364, 0.29828942, 0.18891885, 0.38740741, 0.30086734],
    "YopQ1-15": [0.49962418, 0.96844132, 0.77290941, 0.4739692, 0.74691256, 1.22698631],
    "YopQ1-25": [0.9791465, 0.85120313, 0.97336014, 0.4107113, 1.31386354, 0.69474945],
    "YopQ1-35": [1.32707017, 1.23130563, 1.23649531, 0.68362804, 0.78614696, 1.14339001],
    "YopQ1-45": [1.42828421, 1.32843428, 1.68480714, 0.84469081, 0.79365431, 1.58619928],
    "YopQ1-55": [0.83107654, 1.24622023, 0.92185733, 1.05695509, 0.77422497, 1.28645523],
    "YopQ1-65": [1.62036726, 1.21848867, 0.94667898, 0.91017609, 1.181266, 1.2667003],
    "YopQ1-76": [1.22602291, 0.85284654, 1.15612041, 0.82651662, 1.6654101, 1.17553764],
    "YopQ1-182(wt)": [1, 1, 1, 1, 1, 1]
}

# Create DataFrame
df = pd.DataFrame(data)

# Extract columns for plotting
yopq1_columns = df.columns[1:]
x_categories = np.arange(len(yopq1_columns))  # Numeric representation for curve fitting

# Calculate means and standard deviations for the bar plot
means = df.iloc[:, 1:].mean()
stds = df.iloc[:, 1:].std()

# Create the bar plot with error bars
plt.figure(figsize=(12, 6))
plt.bar(x_categories, means, yerr=stds, color='grey', capsize=5, edgecolor="black", zorder=2)

# Add jittered points using swarmplot for visualizing individual data points
sns.swarmplot(data=df.iloc[:, 1:], color="black", size=6, zorder=3, edgecolor="white")

# Define exponential plateau function
def plateau_exponential_func(x, a, b, c):
    return a * (1 - np.exp(-b * x)) + c

# Fit data using the mean of each group
popt, _ = curve_fit(plateau_exponential_func, x_categories, means.values, p0=[1, 0.1, 0.5], maxfev=10000)

# Generate fit curve data
x_fit = np.linspace(0, len(yopq1_columns) - 1, 100)
y_fit = plateau_exponential_func(x_fit, *popt)

# Plot fitting curve
plt.plot(x_fit, y_fit, 'r-', label=f'Fitting function\n($y = {popt[0]:.2f} \\cdot (1 - e^{{-{popt[1]:.2f}x}}) + {popt[2]:.2f}$)', zorder=4)

# Update labels and styles
plt.ylabel('Relative Secretion', fontsize=12)
plt.xticks(ticks=x_categories, labels=[
    '$YopQ^{1-5}$', '$YopQ^{1-10}$', '$YopQ^{1-15}$', '$YopQ^{1-25}$',
    '$YopQ^{1-35}$', '$YopQ^{1-45}$', '$YopQ^{1-55}$', '$YopQ^{1-65}$',
    '$YopQ^{1-76}$', '$YopQ^{1-182}(wt)$'
], rotation=15)

# Grid and legend
plt.grid(True, axis='y', linestyle='--', linewidth=0.7, zorder=0)
plt.legend(title='', loc='upper left')
plt.tight_layout()
plt.subplots_adjust(left=0.1)
plt.show()

## Save plot with high resolution
#plt.savefig("high_resolution_plot.png", dpi=300)  # Save with 300 DPI

YopQ_Secretion_Boxplot

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
import seaborn as sns

#Explanation of Key Parts:
#
#    sns.swarmplot: This function automatically adds jitter to the points, ensuring they don't overlap. By setting color="black", the points will be displayed in black, and size=5 sets the size of the points.
#
#    Boxplot: The sns.boxplot is still used to display the boxplot for the data distribution in grey (color="darkgrey").
#
#    Fitting Curve: The exponential curve fitting remains the same as before.
#
#    Legends: The legend for the points has been removed, and only the fitting function legend is shown.
#
#Benefits of Using swarmplot:
#
#    Automatic Jittering: It provides a nice way to scatter points without overlap, as it automatically adjusts the points’ positions along the x-axis.
#    Combining with Boxplot: You can easily combine it with the boxplot for showing both the distribution and the individual points.

# 数据集
data = {
    "Sample": ["1 (04.10.23)", "2 (27.20.23)", "3 (30.11.23)", "1 (SR003)", "2 (SR004)", "3 (SR021)"],
    "YopQ1-5": [0.17864353, 0.08980754, 0.13318754, 0.039536, 0.07075792, 0.04852611],
    "YopQ1-10": [0.3663838, 0.45519364, 0.29828942, 0.18891885, 0.38740741, 0.30086734],
    "YopQ1-15": [0.49962418, 0.96844132, 0.77290941, 0.4739692, 0.74691256, 1.22698631],
    "YopQ1-25": [0.9791465, 0.85120313, 0.97336014, 0.4107113, 1.31386354, 0.69474945],
    "YopQ1-35": [1.32707017, 1.23130563, 1.23649531, 0.68362804, 0.78614696, 1.14339001],
    "YopQ1-45": [1.42828421, 1.32843428, 1.68480714, 0.84469081, 0.79365431, 1.58619928],
    "YopQ1-55": [0.83107654, 1.24622023, 0.92185733, 1.05695509, 0.77422497, 1.28645523],
    "YopQ1-65": [1.62036726, 1.21848867, 0.94667898, 0.91017609, 1.181266, 1.2667003],
    "YopQ1-76": [1.22602291, 0.85284654, 1.15612041, 0.82651662, 1.6654101, 1.17553764],
    "YopQ1-182(wt)": [1, 1, 1, 1, 1, 1]
}

# 创建DataFrame
df = pd.DataFrame(data)

# 抽取YopQ1列
yopq1_columns = df.columns[1:]
x_categories = np.arange(len(yopq1_columns))  # 数值化类别索引

# 创建箱线图(灰色)
plt.figure(figsize=(12, 6))
sns.boxplot(data=df.iloc[:, 1:], color="darkgrey", zorder=2)  # Boxplot in dark grey

# Add jittered points using swarmplot (black points)
sns.swarmplot(data=df.iloc[:, 1:], color="black", size=6, zorder=3)  # Swarmplot adds jitter

# 定义指数趋近函数
def plateau_exponential_func(x, a, b, c):
    return a * (1 - np.exp(-b * x)) + c

# 计算每个类别中值并进行拟合
y_data = df.iloc[:, 1:].median().values

# 拟合指数趋近模型
popt, pcov = curve_fit(plateau_exponential_func, x_categories, y_data, p0=[1, 0.1, 0.5], maxfev=10000)
a, b, c = popt

# 生成拟合曲线数据
x_fit = np.linspace(0, len(yopq1_columns) - 1, 100)
y_fit = plateau_exponential_func(x_fit, a, b, c)

# 绘制拟合曲线
plt.plot(x_fit, y_fit, 'r-', label=f'Fitting function\n($y = {a:.2f} \\cdot (1 - e^{{-{b:.2f}x}}) + {c:.2f}$)', zorder=4)

# 更新轴标签
plt.xlabel('')   #YopQ1 (Categories)
plt.ylabel('Relative Secretion', fontsize=12)
plt.title('')    #Boxplot of YopQ1 Measurements with Plateau Exponential Fit and Raw Data

# 显示类别标签
labels = ['$YopQ^{1-5}$', '$YopQ^{1-10}$', '$YopQ^{1-15}$', '$YopQ^{1-25}$', '$YopQ^{1-35}$', '$YopQ^{1-45}$', '$YopQ^{1-55}$', '$YopQ^{1-65}$', '$YopQ^{1-76}$', '$YopQ^{1-182}(wt)$']
plt.xticks(ticks=x_categories, labels=labels, rotation=15)

# Remove the legend for the points, keep for fitting function
plt.legend(title='', loc='upper left')  # Only keep the legend for the fitting function

plt.grid(True)
plt.tight_layout()
plt.subplots_adjust(left=0.1)  # Increase the left margin to avoid cutting off ylabel
plt.show()

## Create a figure with specified size
#fig = plt.figure(figsize=(12, 9))  # (12, 9) corresponds to 1200x900 in pixels at 100 dpi
## Your plotting code goes here (e.g., boxplot, scatter plot, etc.)
## Save as PNG
#fig.savefig("plot.png", dpi=100)  # The dpi is 100, so figsize=(12, 9) will be 1200x900 pixels
## Save as SVG
#fig.savefig("plot.svg")
## Close the plot to release resources
#plt.close(fig)

RNA-seq Tam on CP059040.1 (Acinetobacter baumannii strain ATCC 19606)

nf-core_rnaseq_on_CP059040

http://xgenes.com/article/article-content/209/rna-seq-skin-organoids-on-grch38-chrhsv1-final/ http://xgenes.com/article/article-content/157/prepare-virus-gtf-for-nextflow-run/

Methods

Data was processed using nf-core/rnaseq v3.12.0 (doi: https://doi.org/10.5281/zenodo.1400710) of the nf-core collection of workflows (Ewels et al., 2020).

The pipeline was executed with Nextflow v22.10.5 (Di Tommaso et al., 2017) with the following command:

nextflow run rnaseq/main.nf –input samplesheet.csv –outdir results –fasta /home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.fasta –gff /home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.gff -profile docker -resume –max_cpus 55 –max_memory 512.GB –max_time 2400.h –save_align_intermeds –save_unaligned –save_reference –aligner star_salmon –gtf_group_features gene_id –gtf_extra_attributes gene_name –featurecounts_group_type gene_biotype –featurecounts_feature_type transcript

  1. prepare reference

    They are wildtype strains grown in different medium.
    AUM - artificial urine medium
    Urine - human urine
    MHB - Mueller-Hinton broth
    
    mkdir raw_data; cd raw_data
    ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-1/AUM-1_1.fq.gz AUM_r1_R1.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-1/AUM-1_2.fq.gz AUM_r1_R2.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-2/AUM-2_1.fq.gz AUM_r2_R1.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-2/AUM-2_2.fq.gz AUM_r2_R2.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-3/AUM-3_1.fq.gz AUM_r3_R1.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/AUM-3/AUM-3_2.fq.gz AUM_r3_R2.fq.gz
    
    ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-1/MHB-1_1.fq.gz MHB_r1_R1.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-1/MHB-1_2.fq.gz MHB_r1_R2.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-2/MHB-2_1.fq.gz MHB_r2_R1.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-2/MHB-2_2.fq.gz MHB_r2_R2.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-3/MHB-3_1.fq.gz MHB_r3_R1.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/MHB-3/MHB-3_2.fq.gz MHB_r3_R2.fq.gz
    
    ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-1/Urine-1_1.fq.gz Urine_r1_R1.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-1/Urine-1_2.fq.gz Urine_r1_R2.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-2/Urine-2_1.fq.gz Urine_r2_R1.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-2/Urine-2_2.fq.gz Urine_r2_R2.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-3/Urine-3_1.fq.gz Urine_r3_R1.fq.gz
    ln -s ../X101SC24105589-Z01-J001/01.RawData/Urine-3/Urine-3_2.fq.gz Urine_r3_R2.fq.gz
  2. (Optional) using trinity to find the most closely reference

    Trinity --seqType fq --max_memory 50G --left trimmed/wt_r1_R1.fastq.gz  --right trimmed/wt_r1_R2.fastq.gz --CPU 12
    
    #https://www.genome.jp/kegg/tables/br08606.html#prok
    acb     KGB     Acinetobacter baumannii ATCC 17978  2007    GenBank
    abm     KGB     Acinetobacter baumannii SDF     2008    GenBank
    aby     KGB     Acinetobacter baumannii AYE     2008    GenBank
    abc     KGB     Acinetobacter baumannii ACICU   2008    GenBank
    abn     KGB     Acinetobacter baumannii AB0057  2008    GenBank
    abb     KGB     Acinetobacter baumannii AB307-0294  2008    GenBank
    abx     KGB     Acinetobacter baumannii 1656-2  2012    GenBank
    abz     KGB     Acinetobacter baumannii MDR-ZJ06    2012    GenBank
    abr     KGB     Acinetobacter baumannii MDR-TJ  2012    GenBank
    abd     KGB     Acinetobacter baumannii TCDC-AB0715     2012    GenBank
    abh     KGB     Acinetobacter baumannii TYTH-1  2012    GenBank
    abad    KGB     Acinetobacter baumannii D1279779    2013    GenBank
    abj     KGB     Acinetobacter baumannii BJAB07104   2013    GenBank
    abab    KGB     Acinetobacter baumannii BJAB0715    2013    GenBank
    abaj    KGB     Acinetobacter baumannii BJAB0868    2013    GenBank
    abaz    KGB     Acinetobacter baumannii ZW85-1  2013    GenBank
    abk     KGB     Acinetobacter baumannii AbH12O-A2   2014    GenBank
    abau    KGB     Acinetobacter baumannii AB030   2014    GenBank
    abaa    KGB     Acinetobacter baumannii AB031   2014    GenBank
    abw     KGB     Acinetobacter baumannii AC29    2014    GenBank
    abal    KGB     Acinetobacter baumannii LAC-4   2015    GenBank
  3. Downloading CP059040.fasta and CP059040.gff from GenBank

  4. (Optional) Preparing CP059040.fasta, CP059040_gene.gff3 and CP059040.bed

    #Reference genome: https://www.ncbi.nlm.nih.gov/nuccore/CP059040
    cp /media/jhuang/Elements2/Data_Tam_RNASeq3/CP059040.fasta .     # Elements (Anna C.arnes)
    cp /media/jhuang/Elements2/Data_Tam_RNASeq3/CP059040_gene.gff3 .
    cp /media/jhuang/Elements2/Data_Tam_RNASeq3/CP059040_gene.gtf .
    cp /media/jhuang/Elements2/Data_Tam_RNASeq3/CP059040.bed .
    rsync -a -P CP059040.fasta jhuang@hamm:~/DATA/Data_Tam_RNAseq_2024/
    rsync -a -P CP059040_gene.gff3 jhuang@hamm:~/DATA/Data_Tam_RNAseq_2024/
    rsync -a -P CP059040.bed jhuang@hamm:~/DATA/Data_Tam_RNAseq_2024/
    (base) jhuang@WS-2290C:/media/jhuang/Elements2/Data_Tam_RNASeq3$ find . -name "CP059040*"
    ./CP059040.fasta
    ./CP059040.bed
    ./CP059040.gb
    ./CP059040.gff3
    ./CP059040.gff3_backup
    ./CP059040_full.gb
    ./CP059040_gene.gff3
    ./CP059040_gene.gtf
    ./CP059040_gene_old.gff3
    ./CP059040_rRNA.gff3
    ./CP059040_rRNA_v.gff3
    
    # ---- REF: Acinetobacter baumannii ATCC 17978 (DEBUG, gene_name failed) ----
    #gffread -E -F -T GCA_000015425.1_ASM1542v1_genomic.gff -o GCA_000015425.1_ASM1542v1_genomic.gtf_
    #grep "CDS" GCA_000015425.1_ASM1542v1_genomic.gtf_ > GCA_000015425.1_ASM1542v1_genomic.gtf
    #sed -i -e "s/\tCDS\t/\texon\t/g" GCA_000015425.1_ASM1542v1_genomic.gtf
    #gffread -E -F --bed GCA_000015425.1_ASM1542v1_genomic.gtf -o GCA_000015425.1_ASM1542v1_genomic.bed
    
    grep "locus_tag" GCA_000015425.1_ASM1542v1_genomic.gtf_ > GCA_000015425.1_ASM1542v1_genomic.gtf
    sed -i -e "s/\ttranscript\t/\texon\t/g" GCA_000015425.1_ASM1542v1_genomic.gtf # or using fc_count_type=transcript
    sed -i -e "s/\tgene_name\t/\tName\t/g" GCA_000015425.1_ASM1542v1_genomic.gtf
    gffread -E -F --bed GCA_000015425.1_ASM1542v1_genomic.gtf -o GCA_000015425.1_ASM1542v1_genomic.bed
    #grep "gene_name" GCA_000015425.1_ASM1542v1_genomic.gtf | wc -l  #69=3887-3803
    
    cp CP059040.gff3 CP059040_backup.gff3
    sed -i -e "s/\tGenbank\tgene\t/\tGenbank_gene\t/g" CP059040.gff3
    grep "Genbank_gene" CP059040.gff3 > CP059040_gene.gff3
    sed -i -e "s/\tGenbank_gene\t/\tGenbank\tgene\t/g" CP059040_gene.gff3
    
    #3796-3754=42--> they are pseudogene since grep "pseudogene" CP059040.gff3 | wc -l = 42
    # --------------------------------------------------------------------------------------------------------------------------------------------------
    # ---------- PREPARING gff3 file including gene_biotype=protein_coding+gene_biotype=tRNA = total(3754)) and gene_biotype=pseudogene(42) ------------
    cp CP059040.gff3 CP059040_backup.gff3
    sed -i -e "s/\tGenbank\tgene\t/\tGenbank_gene\t/g" CP059040.gff3
    grep "Genbank_gene" CP059040.gff3 > CP059040_gene.gff3
    sed -i -e "s/\tGenbank_gene\t/\tGenbank\tgene\t/g" CP059040_gene.gff3
    grep "gene_biotype=pseudogene" CP059040.gff3_backup >> CP059040_gene.gff3    #-->3796
    
    #The whole point of the GTF format was to standardise certain aspects that are left open in GFF. Hence, there are many different valid ways to encode the same information in a valid GFF format, and any parser or converter needs to be written specifically for the choices the author of the GFF file made. For example, a GTF file requires the gene ID attribute to be called "gene_id", while in GFF files, it may be "ID", "Gene", something different, or completely missing.
    # from gff3 to gtf
    sed -i -e "s/\tID=gene-/\tgene_id \"/g" CP059040_gene.gtf
    sed -i -e "s/;/\"; /g" CP059040_gene.gtf
    sed -i -e "s/=/=\"/g" CP059040_gene.gtf
    
    #sed -i -e "s/\n/\"\n/g" CP059040_gene.gtf
    #using editor instead!
    
    #The following is GTF-format.
    CP000521.1      Genbank exon    95      1492    .       +       .       transcript_id "gene0"; gene_id "gene0"; Name "A1S_0001"; gbkey "Gene"; gene_biotype "protein_coding"; locus_tag "A1S_0001";
    
    #NZ_MJHA01000001.1       RefSeq  region  1       8663    .       +       .       ID=id0;Dbxref=taxon:575584;Name=unnamed1;collected-by=IG Schaub;collection-date=1948;country=USA: Vancouver;culture-collection=ATCC:19606;gbkey=Src;genome=plasmid;isolation-source=urine;lat-lon=37.53 N 75.4 W;map=unlocalized;mol_type=genomic DNA;nat-host=Homo sapiens;plasmid-name=unnamed1;strain=ATCC 19606;type-material=type strain of Acinetobacter baumannii
    #NZ_MJHA01000001.1       RefSeq  gene    228     746     .       -       .       ID=gene0;Name=BIT33_RS00005;gbkey=Gene;gene_biotype=protein_coding;locus_tag=BIT33_RS00005;old_locus_tag=BIT33_18795
    #NZ_MJHA01000001.1       Protein Homology        CDS     228     746     .       -       0       ID=cds0;Parent=gene0;Dbxref=Genbank:WP_000839337.1;Name=WP_000839337.1;gbkey=CDS;inference=COORDINATES: similar to AA sequence:RefSeq:WP_000839337.1;product=hypothetical protein;protein_id=WP_000839337.1;transl_table=11
    
    ##gff-version 3
    ##sequence-region CP059040.1 1 3980852
    ##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=470
    
    gffread -E -F --bed CP059040.gff3 -o CP059040.bed    #-->3796
    ##prepare the GTF-format (see above) --> ERROR! ----> using CP059040.gff3
    ##stringtie adeIJ.abx_r1.sorted.bam -o adeIJ.abx_r1.sorted_transcripts.gtf -v -G /media/jhuang/Elements/Data_Tam_RNASeq3/CP059040.gff3 -A adeIJ.abx_r1.sorted.gene_abund.txt -C adeIJ.abx_r1.sorted.bam.cov_refs.gtf -e -b adeIJ.abx_r1.sorted_ballgown
    #[01/21 10:57:46] Loading reference annotation (guides)..
    #GFF warning: merging adjacent/overlapping segments of gene-H0N29_00815 on CP059040.1 (179715-179786, 179788-180810)
    #[01/21 10:57:46] 3796 reference transcripts loaded.
    #Default stack size for threads: 8388608
    #WARNING: no reference transcripts found for genomic sequence "gi|1906906720|gb|CP059040.1|"! (mismatched reference names?)
    #WARNING: no reference transcripts were found for the genomic sequences where reads were mapped!
    #Please make sure the -G annotation file uses the same naming convention for the genome sequences.
    #[01/21 10:58:30] All threads finished.
    
    #  ERROR: failed to find the gene identifier attribute in the 9th column of the provided GTF file.
    #  The specified gene identifier attribute is 'Name'
    #  An example of attributes included in your GTF annotation is 'ID=exon-H0N29_00075-1;Parent=rna-H0N29_00075;gbkey=rRNA;locus_tag=H0N29_00075;product=16S ribosomal RNA'
    #  The program has to termin
    
    #  ERROR: failed to find the gene identifier attribute in the 9th column of the provided GTF file.
    #  The specified gene identifier attribute is 'gene_biotype'
    #  An example of attributes included in your GTF annotation is 'ID=exon-H0N29_00075-1;Parent=rna-H0N29_00075;gbkey=rRNA;locus_tag=H0N29_00075;product=16S ribosomal RNA'
    #  The program has to terminate.
    
    #grep "ID=cds-" CP059040.gff3 | wc -l
    #grep "ID=exon-" CP059040.gff3 | wc -l
    #grep "ID=gene-" CP059040.gff3 | wc -l   #the same as H0N29_18980/5=3796
    grep "gbkey=" CP059040.gff3 | wc -l  7695
    grep "ID=id-" CP059040.gff3 | wc -l  5
    grep "locus_tag=" CP059040.gff3 | wc -l    7689
    #...
    cds   3701                             locus_tag=xxxx, no gene_biotype
    exon   96                              locus_tag=xxxx, no gene_biotype
    gene   3796                            locus_tag=xxxx, gene_biotype=xxxx,
    id  (riboswitch+direct_repeat,5)       both no --> ignoring them!!  # grep "ID=id-" CP059040.gff3
    rna    96                              locus_tag=xxxx, no gene_biotype
    ------------------
        7694
    
    cp CP059040.gff3_backup CP059040.gff3
    grep "^##" CP059040.gff3 > CP059040_gene.gff3
    grep "ID=gene" CP059040.gff3 >> CP059040_gene.gff3
    #!!!!VERY_IMPORTANT!!!!: change type '\tCDS\t' to '\texon\t'!
    sed -i -e "s/\tgene\t/\texon\t/g" CP059040_gene.gff3
  5. Preparing the directory trimmed

    mkdir trimmed trimmed_unpaired;
    for sample_id in AUM_r1 AUM_r2 AUM_r3 Urine_r1 Urine_r2 Urine_r3 MHB_r1 MHB_r2 MHB_r3; do \
    for sample_id in MHB_r1 MHB_r2 MHB_r3; do \
            java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 100 raw_data/${sample_id}_R1.fq.gz raw_data/${sample_id}_R2.fq.gz trimmed/${sample_id}_R1.fq.gz trimmed_unpaired/${sample_id}_R1.fq.gz trimmed/${sample_id}_R2.fq.gz trimmed_unpaired/${sample_id}_R2.fq.gz ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20; done 2> trimmomatic_pe.log;
    done
  6. Preparing samplesheet.csv

    sample,fastq_1,fastq_2,strandedness
    AUM_r1,AUM_r1_R1.fq.gz,AUM_r1_R2.fq.gz,auto
    AUM_r2,AUM_r2_R1.fq.gz,AUM_r2_R2.fq.gz,auto
    AUM_r3,AUM_r3_R1.fq.gz,AUM_r3_R2.fq.gz,auto
    MHB_r1,MHB_r1_R1.fq.gz,MHB_r1_R2.fq.gz,auto
    MHB_r2,MHB_r2_R1.fq.gz,MHB_r2_R2.fq.gz,auto
    MHB_r3,MHB_r3_R1.fq.gz,MHB_r3_R2.fq.gz,auto
    Urine_r1,Urine_r1_R1.fq.gz,Urine_r1_R2.fq.gz,auto
    Urine_r2,Urine_r2_R1.fq.gz,Urine_r2_R2.fq.gz,auto
    Urine_r3,Urine_r3_R1.fq.gz,Urine_r3_R2.fq.gz,auto
  7. nextflow run

    #Example1: http://xgenes.com/article/article-content/157/prepare-virus-gtf-for-nextflow-run/
    
    # ---- SUCCESSFUL with directly downloaded gff3 and fasta from NCBI using docker ----
    docker pull nfcore/rnaseq
    ln -s /home/jhuang/Tools/nf-core-rnaseq-3.12.0/ rnaseq
    (host_env) jhuang@WS-2290C:~/DATA/Data_Tam_RNAseq_2024$ /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.fasta" --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.gff"        -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'
    /usr/local/bin/nextflow run rnaseq/main.nf --input samplesheet.csv --outdir results    --fasta "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.fasta" --gff "/home/jhuang/DATA/Data_Tam_RNAseq_2024/CP059040.gff"        -profile docker -resume  --max_cpus 55 --max_memory 512.GB --max_time 2400.h    --save_align_intermeds --save_unaligned --save_reference    --aligner 'star_salmon'    --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'transcript'
    
    # -- DEBUG_1 --
    #After checking the record (see below) in results/genome/CP059040.gtf, we have to change 'exon' to 'transcript', the default values are --gtf_group_features 'gene_id'  --gtf_extra_attributes 'gene_name' --featurecounts_group_type 'gene_biotype' --featurecounts_feature_type 'exon'
    #In ./results/genome/CP059040.gtf e.g. "CP059040.1      Genbank transcript      1       1398    .       +       .       transcript_id "gene-H0N29_00005"; gene_id "gene-H0N29_00005"; gene_name "dnaA"; Name "dnaA"; gbkey "Gene"; gene "dnaA"; gene_biotype "protein_coding"; locus_tag "H0N29_00005";"

Function of Small basic protein (Sbp) in Staphylococcus epidermidis biofilm matrix assembly: molecular mechanisms and spatio-temporal patterning

https://tu-dresden.de/mn/biologie/allgemeine_mikrobiologie/spp2389/research-groups/function-of-small-basic-protein-sbp-in-staphylococcus-epidermidis-biofilm-matrix-assembly-molecular-mechanisms-and-spatio-temporal-patterning

  • The assembly of multi-layered bacterial biofilm architectures is a prime example in which bacterial single cell traits are integrated and translated into population behaviour, which in turn provides cues for individual cells to prevail in changing potentially harsh environments.

  • A hallmark of biofilm formation is the production of an extracellular matrix, and in fact, biofilm matrix production and its functional consequences are outstanding examples of emergent functions of bacterial multicellularity.

  • The biofilm matrix consists of a plethora of various biomolecules, referred to as the “matrixome”.

  • Recently, we have identified 18 kDa Small basic protein (Sbp) from S. epidermidis biofilm matrix preparations.

  • Sbp accumulates at the biofilm – substratum interface and in the intercellular space and thereby functionally supports biofilm formation.

  • In addition, a sbp knock-mutant exhibits striking stationary growth phase defects and increased susceptibility against glycopeptides, both phenotypes being consistent with stringent response gene expression patterns as identified in transcriptome analysis.

  • The available evidence supports the hypothesis that Sbp matrix assembly occurs along temporally and spatially orchestrated events, providing an environment which fundamentally affects bacterial physiology.

  • We will use high resolution imaging in combination with biochemical approaches and single cell techniques to dissect principles of Sbp biofilm matrix assembly, identify and characterize involved interaction networks, and to provide insights into the importance of Sbp biofilm matrices for staphylococcal physiology.

  • In the overarching context, the project will provide molecular insights into general principles of dynamics driving bacterial multicellularity, and their inter-relation with bacterial physiology on population and single cell levels.

Cluster Analysis and Quantification of Segmented Volumes from IMARIS 3D Segmentation Data

    % the goal of this script is to determine cluster information starting from
    % the results extracted from IMARIS 3D segmentation 
    clear all

    clast_par=1.3; 
    % filename = '20241107 1457 18h_0002-1.tif DAPI.xls';
    % filename = '202406 1457dsbp pRB473_1.xls';
    % filename = '1457 1h 2.xls';
    %filename = '20231214 1457dsbp pRB473 sbp alfa 508_003-1.xls';
    filename = '20240321 1457dsbppRB473sbp ALFA508 0 1.xls';
    %filename = '202406 1457dsbp pRB473_1.xls';
    %The only two input necessary are the name of the file to be analized i.e file name and parameter that control how far away two
    % segmented volumes should be. clast_par < 1 means the distance between two segmented volume should be inferior to the sum of their mean radius
    % clast_par =1 the distance between two segmented volumes should be equal to the sum of their mean radius
    %  clast_par >1 the distance between two segmented volumes should be bigger than the sum of their mean radius

    % the data to be quantified is stored in an excel multipage file

    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% segmented volumes radius determination
    %read_volume=xlsread(filename,32);% opens page where the volumes are listed and extracts the volume information for each cluster
    read_volume=readtable(filename, 'Sheet', 32);
    siz_point=size(read_volume);     %it extracts the information about how many volumes were quantified 
    radius_nn(1:siz_point(1))=0;     %it creates a vector that will contain the estimated radius of all the detected volumes
    %radius_nn=1000*( read_volume(:,1)*3/(4*pi)).^(1/3);
    radius_nn = 1000 * (read_volume.Volume * 3 / (4 * pi)).^(1 / 3);
    % the radius is calculated approximating the quantified volumes with be the volumes of a sphere
    Mean_radius_nm = mean(radius_nn)   % displays the mean radius of the segmented volumes
    %please note that the radius of the segmented volumes is a good
    %approximation of the full width at half maximum of the fluorescence object
    Variance_radius_n = std(radius_nn)  % displays the variance of the radius of the segmented volumes
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% determination of
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% the distance between the segmented volumes and sorting them between volumes belonging to the same cluster and volumes not belonging to the same cluster.
    % center_of_mass=xlsread(filename,5);% open page where the coordinate of the center of mass of the segmented volumes are listed
    % 
    % total_num=size(center_of_mass); % again number of detected spots just to check if it match with the previous values (not necessary)
    % total_resolved_spots = total_num(1) % displays the total number of volumes quantified
    % %center_of_mass=xlsread(filename,5); % open the page of the file where the values of the centre of mass of the detected volumes are stored and write them into a matrix
    % distances(1:siz_point(1),1:siz_point(1))=0;% produce the matrix were the distances between detected volumes will be stored
    % cluster(1:siz_point(1),1:siz_point(1))=0;%produce a matrix where will be stored the information wether two segmented volumes are close together (distance fulfil cluster parameter) cluster (i,j)=1
    % %or not cluster(i,j)=0;
    % 
    % %%calculates the matrix of the distance between segmented volumes i and j as the distance between their centre of mass
    % %the factor 1000 converts the distance from micronmeter to nanometer
    % for i=1:siz_point(1)
    % for j=1:siz_point(1)
    % 
    % distances(i,j)= sqrt( (center_of_mass(i,1)-center_of_mass(j,1))^2 + (center_of_mass(i,2)-center_of_mass(j,2))^2 + (center_of_mass(i,3)-center_of_mass(j,3))^2 )*1000;
    % end
    % end
    % 
    % %Here is determined if two volumes are clustering together i.e. if their distance <= the sum of their radius multiplied for clust_par defined above, or not
    % for i=1:siz_point(1)
    % for j=1:siz_point(1)
    % 
    % if i==j
    %     cluster(i,j)=0;
    % else
    %     if  distances(i,j)<=clast_par*(radius_nn(i)+radius_nn(j))
    %     cluster(i,j)=1;
    %     end
    % end
    % 
    % end
    % end
    % cluster3=cluster; %%%here the cluster matrix is doubled for keep avaliable its information at the end of the script

    % Read the data from Sheet 5 (Center of Homogeneous Mass)
    center_of_mass = readtable(filename, 'Sheet', 5);

    % Display the modified variable names (these are the names used by MATLAB)
    disp('Modified Variable Names:');
    disp(center_of_mass.Properties.VariableNames);

    % Display the original column headers (as stored in the Excel file)
    disp('Original Column Names:');
    disp(center_of_mass.Properties.VariableDescriptions);

    % Define the starting row (based on the repeating headers)
    %start_row = 3; % Adjust this according to your actual data structure

    % Read the table, starting from the correct row
    %center_of_mass = readtable(filename, 'Sheet', 5, 'Range', sprintf('A%d', start_row));

    % Optional: Remove rows that are still unwanted (if any)
    % For example, if there are rows with unwanted headers in the middle of the data:
    %center_of_mass = center_of_mass(~ismember(center_of_mass{:, 1}, {'CenterofHomogeneousMass'}), :);
    %center_of_mass = center_of_mass(~ismember(center_of_mass{:, 1}, {'CenterofHomogeneousMassX'}), :);

    % Use the modified column names to extract the data
    x_coords = center_of_mass{:, 'CenterOfHomogeneousMass'};  % Extract X coordinates
    y_coords = center_of_mass{:, 'Var2'};  % Adjust as needed for Y coordinates
    z_coords = center_of_mass{:, 'Var3'};  % Adjust as needed for Z coordinates

    % Determine the number of detected spots (volumes)
    siz_point = size(center_of_mass);  % Size of the table, number of rows (spots)
    total_resolved_spots = siz_point(1);  % Total number of volumes quantified
    disp(['Total number of volumes quantified: ', num2str(total_resolved_spots)])

    % Create a matrix to store the distances between detected volumes
    disp('Initializing distance matrix...');
    distances = zeros(total_resolved_spots, total_resolved_spots);  % Initialize distance matrix

    % Create a matrix to store cluster information (1 for close volumes, 0 for far volumes)
    disp('Initializing cluster matrix...');
    cluster = zeros(total_resolved_spots, total_resolved_spots);  % Initialize cluster matrix

    % Loop to calculate the distance matrix between detected volumes
    % The factor 1000 converts the distance from micrometers to nanometers
    disp('Calculating distance matrix...');
    for i = 1:total_resolved_spots
        for j = 1:total_resolved_spots
            distances(i,j) = sqrt( (x_coords(i) - x_coords(j))^2 + (y_coords(i) - y_coords(j))^2 + (z_coords(i) - z_coords(j))^2 ) * 1000;  % Calculate distance in nm
        end
    end

    % Loop to calculate whether two volumes belong to the same cluster
    % Check if distance between volumes is less than or equal to the sum of their radii
    disp('Determining cluster assignments...');
    % ERROR_NAME_SHOULD_BE_clast_par.    clust_par = 1.5;  % Define your clustering parameter (this is an example, adjust as necessary)
    for i = 1:total_resolved_spots
        for j = 1:total_resolved_spots
            if i == j
                cluster(i,j) = 0;  % No clustering with itself
            else
                if distances(i,j) <= clast_par * (radius_nn(i) + radius_nn(j))  % Check if the distance is less than the sum of radii
                    cluster(i,j) = 1;  % Assign to the same cluster
                end
            end
        end
    end

    cluster3 = cluster;  % Store the cluster matrix for further use
    %disp('Cluster Matrix (1 indicates volumes in the same cluster, 0 otherwise):');
    %disp(cluster3);

    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    % determination of the number of clusters the cluster size i.e. how many
    % volumes are contained into one cluster and other cluster information

    numclast=0;
    numclast2(1:siz_point(1),1)=1;
    numclast3(1:siz_point(1),1:siz_point(1))=0;
    num=0; % initial number of clusters
    cluster_record(1:siz_point(1),1:siz_point(1))=0;% matrix where the label of the cluster members should be stored i.e the label
    %of the volumes of cluster members. Please note that the column number of cluster_record matrix is also the label of the first volume find for a given
    %cluster if this column is not made only by zero values

    % searching and isolating clusters
    for i=1:siz_point(1)
        count=0;  % how many volumes are clustering with an individual volume
        count3=0; % how many volumes are belonging to the same cluster
        if sum( cluster(i,:))>=1
        num=num+1;
        %     ind= sum( cluster(i,:));

            for j=1:siz_point(1)
                if  cluster(i,j)==1
                    count=count+1;
                    count3=count3+1;
                    cluster(i,j)=0;% every time if a point on the cluster matrix is found to belong to a specific cluster it is set to
                    %zero in order to not choose this point twice. At the end of the cluster search the cluster matrix should be a full zeros matrix
                    cluster(j,i)=0;% everytime if a point on the cluster matrix is found to belong to a specific cluster is set to
                    %zero in order to the choose this point twice at the end the cluster matrix should be a full zeros matrix
                    vect(count)=j;% here are listed the label of the volumes clastering with the volume under analysis
                    cluster_record(count3,i)=j;
                end
            end
        end

        while count > 0;%%%% in this recursive loop are searched the volumes that were clustering with the first one and in case
            %they are found clustering with other volumes these are stored and this cycle is repeated until when no more clustering volume are found
           count2=count;
           count=0;
           vect2=vect;% saving the information contained in vect (is a variable size vector so overwriting is not to suggest)

           clear vect % delete this variable in a way it can be defined newly in the following part of the loop

           for ss=1:count2
                for j=1:siz_point(1)
                if  cluster(vect2(ss),j)==1
                 count=count+1;

                 cluster(vect2(ss),j)=0;
                 cluster(j,vect2(ss))=0;
                vect(count)=j;
                ll=0; %%ll=0 a volume is for the first time found to be member of a given cluster; ll=1 the volume was already found to be part of the cluster
                % and it will be not counted twice
                     for ms=1:count3
                         if cluster_record(ms,i)==j
                             ll=1;
                         end
                     end
                  if ll==0
                count3=count3+1;
                cluster_record(count3,i)=j;
                  end
                  ll=0;

                end
                end
            end
            clear vect2 %  this variable vector will be redefined at the beginning of the loop
            count;
        end

        if num > 0 % if a cluster is found
            if count3>0
                numb(num)=count3+1;% 1 comes from the fact that a cluster is always found starting from a volume that is not included
                %in the counting routine
            end
        end

    end
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%dysplaying the information
    sit=size(numb);
    number_of_cluster=sit(2)% the number of columns of the variable numb gives the total number of clusters
    total_number_of_spot_in_cluster= sum(numb)% the total number of volumes belonging to the one of the clusters (it must be inferior to the total number of volumes quantified!!!)
    average_number_of_spot_in_cluster= mean(numb)%  mean number of volumes belonging to a cluster
    dispersion_number_of_spot_in_cluster= std(numb)% variance of the number of volumes belonging to a cluster
    frequent_number_of_spot_in_cluster= median(numb)% median of the number of volumes belonging to a cluster

    indd = 0;

    % Preallocate dist and disti arrays with a sensible size limit
    max_cluster_size = siz_point(1);  % Max number of volumes per cluster, assuming worst case

    % Preallocate clust_info for efficiency
    clust_info = zeros(siz_point(1), 8); % Preallocate space for cluster info

    for l = 1:siz_point(1)

        % Display progress
        fprintf('Processing cluster %d of %d\n', l, siz_point(1));

        % Initialize variables
        cluster_member = [];
        if sum(cluster_record(:, l)) >= 1
            indd = indd + 1;
            cluster_member = [l; cluster_record(cluster_record(:, l) >= 1, l)]; % All labels in the cluster
        end

        if ~isempty(cluster_member)
            got = numel(cluster_member); % Number of volumes in the cluster
            rel = 0; % To keep track of valid distances
            xx = 0; yy = 0; zz = 0;

            % Sum x, y, z positions (in nm, factor 1000)
            cluster_coords = [center_of_mass{cluster_member, 'CenterOfHomogeneousMass'}, ...
                              center_of_mass{cluster_member, 'Var2'}, ...
                              center_of_mass{cluster_member, 'Var3'}] * 1000; % Coordinates in nm

            xx = sum(cluster_coords(:, 1));
            yy = sum(cluster_coords(:, 2));
            zz = sum(cluster_coords(:, 3));

            % Vectorized distance calculation between all points in the cluster
            % We calculate the pairwise distance matrix for all points in the cluster
            dist_matrix = pdist2(cluster_coords, cluster_coords); % pdist2 is highly optimized

            % Extract non-zero distances and store them in disti
            disti = dist_matrix(dist_matrix > 0); % Convert to vector and remove zeros

            % Collect statistical information
            clust_info(indd, 1) = numb(indd); % Number of volumes in the cluster
            clust_info(indd, 2) = 0.5 * mean(disti); % Mean distance between volumes
            clust_info(indd, 3) = std(disti / 2); % Estimate of variance
            clust_info(indd, 4) = max(disti); % Max distance
            clust_info(indd, 5) = min(disti); % Min distance
            clust_info(indd, 6) = xx / got; % X coordinate of the cluster center of mass
            clust_info(indd, 7) = yy / got; % Y coordinate of the cluster center of mass
            clust_info(indd, 8) = zz / got; % Z coordinate of the cluster center of mass

            % Clear temporary variables for next iteration
            clear dist_matrix disti cluster_coords
        end
    end

    % --------- Preparing the final results and save them in the Excel files ---------
    %%% the labels of cluster members are saved in a two column matrix the
    %%% second column contains the label of the first member of the cluster and
    %%% it repeats its value until when the same cluster is under consideration.
    %%% The first column displays the label of the other volumes belonging to
    %%% the clusters
    a=0;
    for i=1:siz_point(1)
        if sum(cluster_record(:,i))>=1
            for j=1:siz_point(1)

                if cluster_record(j,i)>=1
                    a=a+1;
                    position(a,1)=cluster_record(j,i);
                    position(a,2)=i;
                end
            end
        end
    end
    position2= position-1; % since the labelling in Imaris starts from number 0 and not from one in order to find the same cluster in Imaris to the
    % to the label values must be substructed 1
    %
    %%%%%%%% printing in the output file the data  quantified till now
    %%%%%%
    names={'number of particles', 'mean distance', 'variance of distance', 'max distance', 'min distance', 'center_x', 'center_y', 'center_z' };

    names2={'Mean_radius_nm', 'Variance_radius_n ', 'Total_resolved_spots' };
    values=[Mean_radius_nm , Variance_radius_n  total_resolved_spots];
    % xlswrite([num2str(clast_par) '_output_' filename '.xlsx'],names2,1,'A1')
    % xlswrite([num2str(clast_par) '_output_' filename '.xlsx'],values,1,'A2')
    % xlswrite([num2str(clast_par) '_output_' filename '.xlsx'],clust_info,2,'A2')
    % xlswrite([num2str(clast_par) '_output_' filename '.xlsx'],names,2,'A1')
    % xlswrite([num2str(clast_par) '_output_' filename '.xlsx'],position2,3,'A1')
    % Write 'names2' and 'values' to the first sheet
    %DEBUG: writetable(table(values), [num2str(clast_par) '_output_' filename '.xlsx'], 'Sheet', 1, 'WriteRowNames', false, 'WriteVariableNames', false);
    %DEBUG: writetable(table(names2), [num2str(clast_par) '_output_' filename '.xlsx'], 'Sheet', 1, 'WriteRowNames', false);
    % Create a table where the first row is the header and the second is the data
    output_table = [names2; num2cell(values)];
    disp(output_table);
    % Write the table to the Excel file, starting at cell A1
    writetable(cell2table(output_table), [num2str(clast_par) '_output_' filename '.xlsx'], 'Sheet', 1, 'WriteRowNames', false, 'WriteVariableNames', false);
    % % Saving as an older .xls format (Excel 97-2003 format)
    % writetable(cell2table(output_table), [num2str(clast_par) '_output_' filename '.xls'], 'Sheet', 1, 'WriteRowNames', false, 'WriteVariableNames', false);
    % Save as CSV file
    writetable(cell2table(output_table), [num2str(clast_par) '_output_' filename '_Sheet1.csv'], 'WriteRowNames', false, 'WriteVariableNames', false);

    % Convert 'clust_info' into a table if it's not one already and write to the second sheet
    clust_info_table = array2table(clust_info, 'VariableNames', names);  % Convert 'clust_info' to table with proper column names
    writetable(clust_info_table, [num2str(clast_par) '_output_' filename '.xlsx'], 'Sheet', 2, 'WriteRowNames', false);
    % Assuming position2 is a matrix that needs to be written to the third sheet
    writetable(array2table(position2), [num2str(clast_par) '_output_' filename '.xlsx'], 'Sheet', 3, 'WriteRowNames', false, 'WriteVariableNames', false);

    %%%% quantification of the distance between different clusters
    %%%%% the distance between clusters is defined to be the distance between their centre of
    %%%%% mass
    dist_clust(1:sit(2),1:sit(2))=0;
    cc1=0;
    for i=1:sit(2)
        for j=1:sit(2)
            if i==j
                dist_clust(i,j)=100000;% the fact that the distance of one cluster respect itself is zero would make more complicate
                %to determine the minimum distance between a cluster and it neighbours
            else

                cc1=cc1+1;
                dist_clust(i,j)=sqrt( (clust_info(i,6)-clust_info(j,6))^2 + (clust_info(i,7)-clust_info(j,7))^2 + (clust_info(i,8)-clust_info(j,8))^2 );

                dist_clust2(cc1)=dist_clust(i,j);
            end
        end
    end

    mindist_cluster=min(dist_clust);% the minimum distance between one cluster and the other ones
    mean_dist_cluster=mean(mindist_cluster);% the mean distance between closer clusters
    %%% exporting the latest information
    % xlswrite([num2str(clast_par) '_output_' filename '.xlsx'], dist_clust,4,'A1')
    % xlswrite([num2str(clast_par) '_output_' filename '.xlsx'],mean_dist_cluster,5,'A1')
    % xlswrite([num2str(clast_par) '_output_' filename '.xlsx'],mindist_cluster',5,'B1')
    % names3={'number_of_cluster', 'total_number_of_spot_in_cluster ', 'average_number_of_spot_in_cluster', 'dispersion_number_of_spot_in_cluster','frequent_number_of_spot_in_cluster' }
    % values3=[number_of_cluster total_number_of_spot_in_cluster average_number_of_spot_in_cluster dispersion_number_of_spot_in_cluster frequent_number_of_spot_in_cluster];
    % values3=values3';
    % xlswrite([num2str(clast_par) '_output_' filename '.xlsx'],names3,6,'A1')
    % xlswrite([num2str(clast_par) '_output_' filename '.xlsx'],values3',6,'A2')

    % Exporting the latest information
    writetable(table(dist_clust), [num2str(clast_par) '_output_' filename '.xlsx'], 'Sheet', 4, 'WriteRowNames', true);
    writetable(table(mean_dist_cluster), [num2str(clast_par) '_output_' filename '.xlsx'], 'Sheet', 5, 'WriteRowNames', true);
    writetable(table(mindist_cluster'), [num2str(clast_par) '_output_' filename '.xlsx'], 'Sheet', 5, 'WriteRowNames', true, 'WriteVariableNames', false);

    % Save other information about clusters
    names3 = {'number_of_cluster', 'total_number_of_spot_in_cluster', 'average_number_of_spot_in_cluster', ...
        'dispersion_number_of_spot_in_cluster', 'frequent_number_of_spot_in_cluster'};
    values3 = [number_of_cluster, total_number_of_spot_in_cluster, average_number_of_spot_in_cluster, ...
        dispersion_number_of_spot_in_cluster, frequent_number_of_spot_in_cluster];

    %DEBUG: writetable(table(values3), [num2str(clast_par) '_output_' filename '.xlsx'], 'Sheet', 6, 'WriteRowNames', true, 'WriteVariableNames', false);
    %DEBUG: writetable(table(names3), [num2str(clast_par) '_output_' filename '.xlsx'], 'Sheet', 6, 'WriteRowNames', true);
    % Create a table where the first row is the header and the second row is the data
    output_table3 = [names3; num2cell(values3)];
    disp(output_table3);
    % Write the combined table to the Excel file, starting from cell A1 of Sheet 6
    writetable(cell2table(output_table3), [num2str(clast_par) '_output_' filename '.xlsx'], 'Sheet', 6, 'WriteRowNames', false, 'WriteVariableNames', false);
    % % Saving as an older .xls format (Excel 97-2003 format)
    % writetable(cell2table(output_table3), [num2str(clast_par) '_output_' filename '.xls'], 'Sheet', 6, 'WriteRowNames', false, 'WriteVariableNames', false);
    % Save as CSV file
    writetable(cell2table(output_table3), [num2str(clast_par) '_output_' filename '_Sheet6.csv'], 'WriteRowNames', false, 'WriteVariableNames', false);

AGR-Kühler defekt – Symptome, Reparatur & Kosten (AGR散热器故障 – 症状、维修与费用)

https://www.motointegrator.de/blog/agr-kuehler-defekt-symptome-reparatur-kosten/

Pos. ATU-Nr. Dienstleistung / Teilebezeichnung Menge Betrag EUR Brutto

1 TEXT00 KV für AGR-Ventil erneuern 2 1EIN60 AGR-KÜHLER – AUS- UND EINBAUEN 1

3 FREMD0 AGR Kühler-Ventil (WM: 301.14.54) 1

4 NO6931 NORAUTO KUEHLERSCH. READYMI G12 EVO 5L 2

Werkstatt Alborz KfZ-Meisterbetrieb (https://alborz.gmbh/)

Hr. Mostafa

Kelloggstr. 8, 22045 HH

040-6570539 (0179-68372)

AGR散热器(也称为废气散热器)属于废气再循环(AGR)系统的一部分。它们的作用是将废气流中的一部分在返回进气道之前降温。为此,AGR散热器集成在冷却水循环系统中。故障的AGR散热器可能会导致高温负荷、废气残留物和因老化造成的磨损。其症状非常明显。一旦出现这些症状,应立即更换AGR散热器。

目录

* 10种常见的AGR散热器故障症状
* AGR散热器维修与费用
* 常见问题解答
   - AGR散热器的作用是什么?
   - 故障的AGR散热器能继续使用吗?
   - 如果AGR散热器坏了,会发生什么?

10种常见的废气再循环(AGR)散热器故障症状

一个故障的AGR散热器可能会表现出多种症状,这些症状不仅会影响车辆的性能,还可能导致严重的故障。准确了解AGR散热器故障的迹象非常重要,以便及时发现和修复可能的损坏。以下是一些可能表明AGR散热器故障的常见症状:

* 涡轮增压器的增压压力下降
* 冷却液流失,伴随发动机温度升高
* 废气压力无法控制地泄漏
* 在发动机运转时,冷却水缓慢流失
* 发动机重新启动后,可能会对气缸、活塞和连杆造成损害
* 车内有废气味
* 发动机关闭后,冷却系统的压力下降
* AGR散热器上可见泄漏,尤其是在废气侧
* 发动机区域温度异常
* 发动机舱内有明显的蒸汽或烟雾

一个故障的AGR散热器通过明显的症状表现出来。一个明显的故障迹象是冷却水的缓慢流失,伴随着更高的发动机温度。通常,这种流失最初是不可察觉的,因为在发动机运转时,废气背压高于冷却系统的压力。一旦发动机关闭,压力下降,冷却液便会进入发动机。重新启动发动机时,可能会对气缸、活塞和连杆造成损害。另一个症状是在车内出现废气味。

AGR散热器维修和费用

当出现AGR散热器故障症状时,应立即更换散热器。否则,可能会造成严重的损坏,甚至导致发动机严重损坏。尤其是当冷却水流失严重时,可能会导致发动机过热,从而损坏气缸盖垫片和气缸盖。

AGR散热器故障的维修费用根据车辆型号差异较大。通常,需要拆卸多个部件,如废气管、油回流管或柴油颗粒过滤器,以便接触到散热器。一旦散热器暴露出来,就可以进行更换。建议不要使用二手AGR散热器。

如果废气散热器故障,AGR阀通常也需要更换,因为它们通常是一个整体部件。如果这些部件是分开安装的,或者只是AGR阀损坏,则不需要更换两个部件。

一个新的AGR散热器原厂价格通常在750欧元到1500欧元之间。此外,还需要支付新冷却液的材料费用——大约三到五升。每升冷却液大约8到10欧元,总费用大约为25欧元到50欧元。更换散热器的工作时间也需要计算在内。此过程中还包括AGR阀的重新学习。整体工作时间大约为四到六小时。根据工时费,工作费用大约为400欧元到600欧元。最终费用根据车辆型号不同而有所不同。总体而言,预计费用大约为1100欧元到2200欧元。

通过在Motointegrator的在线商店购买新的AGR散热器和相关组件,您可以节省费用。我们的商品包括约500万个汽车配件和配件,来自顶级品牌和优质制造商,价格优惠。欢迎您亲自来看看并在我们的在线商店中选购。

常见问题解答

AGR散热器的作用是什么?

AGR散热器,也称为废气散热器,是废气再循环系统(AGR)的一个重要组成部分,旨在在废气被重新引导进入进气管之前对部分废气流进行冷却。它集成在冷却水循环系统中。一个故障的AGR散热器可能会导致过高的热负荷、废气残留物和因老化造成的磨损。这些明显的症状表明,必须立即更换AGR散热器。

故障的AGR散热器可以继续使用吗?

通常情况下,如果AGR阀故障,短期内不会产生重大影响。然而,如果发生严重故障,导致发动机进入紧急模式,则应立即前往修理厂。

如果AGR散热器坏了,会发生什么?

一个故障的AGR散热器可能会导致发动机过热、废气残留物堆积以及加速老化和磨损。

AGR-Kühler (auch Abgaskühler genannt) gehören zur Abgasrückführung (AGR). Sie sorgen dafür, dass die Temperaturen eines Teils des Abgasstroms vor der Rückführung in den Ansaugtrakt heruntergekühlt werden. Dafür ist der AGR-Kühler in den Kühlwasserkreislauf integriert. Ein defekter AGR-Kühler kann zu hohen thermischen Belastungen, Abgasrückständen und altersbedingtem Verschleiß führen. Die Symptome sind eindeutig. Treten diese auf, sollte der AGR-Kühler umgehend ausgetauscht werden.

Inhaltsverzeichnis

* 10 häufige Symptome eines defekten AGR-Kühlers
* AGR-Kühler reparieren und Kosten
* FAQs
   - Was macht ein AGR-Kühler?
   - Kann man mit defektem AGR-Kühler fahren?
   - Was passiert, wenn der AGR-Kühler kaputt ist?

AGR-Kühler defekt – Symptome, Reparatur & Kosten

AGR-Kühler (auch Abgaskühler genannt) gehören zur Abgasrückführung (AGR). Sie sorgen dafür, dass die Temperaturen eines Teils des Abgasstroms vor der Rückführung in den Ansaugtrakt heruntergekühlt werden. Dafür ist der AGR-Kühler in den Kühlwasserkreislauf integriert. Ein defekter AGR-Kühler kann zu hohen thermischen Belastungen, Abgasrückständen und altersbedingtem Verschleiß führen. Die Symptome sind eindeutig. Treten diese auf, sollte der AGR-Kühler umgehend ausgetauscht werden.

10 häufige Symptome eines defekten AGR-Kühlers

Ein fehlerhafter AGR-Kühler kann sich durch eine Vielzahl von Symptomen bemerkbar machen, die nicht nur die Leistung des Fahrzeugs beeinträchtigen, sondern auch zu ernsthaften Problemen führen können. Ein genaues Verständnis der Anzeichen eines AGR-Kühler-Defekts ist von entscheidender Bedeutung, um mögliche Schäden zu erkennen und zu beheben. Hier sind einige häufige Symptome, die auf einen defekten AGR-Kühler hinweisen können:

* Abfallender Ladedruck des Turboladers
* Verlust von Kühlflüssigkeit in Verbindung mit erhöhter Motortemperatur
* Unkontrolliertes Entweichen des Abgasdrucks
* Schleichender Verlust von Kühlwasser beim laufenden Motor
* Mögliche Schäden an Zylinder, Kolben und Pleuel nach erneutem Start des Motors
* Abgasgeruch im Fahrzeuginnenraum
* Druckabfall im Kühlsystem nach Abstellen des Motors
* Sichtbare Undichtigkeiten am AGR-Kühler, besonders auf der Abgasseite
* Temperaturanomalien im Motorbereich
* Auffällige Dampf- oder Rauchentwicklung im Motorraum

Ein defekter AGR-Kühler zeigt sich durch eindeutige Symptome. Ein sicheres Anzeichen für einen Defekt am Abgaskühler ist ein schleichender Verlust des Kühlwassers in Verbindung mit einer höheren Motortemperatur. Meist bleibt dieser Verlust zunächst unentdeckt, weil bei einem laufenden Motor der Abgasgegendruck höher als der Druck im Kühlsystem ist. Sobald der Motor abgestellt wird, sinkt der Druck und die Kühlflüssigkeit gelangt in den Motor. Wird der Motor wieder gestartet, kann es zu Schäden an Zylinder, Kolben und Pleuel kommen. Ein weiteres Symptom ist Abgasgeruch im Fahrzeuginnenraum. AGR-Kühler reparieren und Kosten

Treten Symptome eines defekten AGR-Kühlers auf, sollte umgehend der Kühler ausgetauscht werden. Anderenfalls besteht die Gefahr von Schäden bis hin zu einem kapitalen Motorschaden. Das gilt vor allem dann, wenn der Kühlwasserverlust so stark ist, dass der Motor überhitzt. Dann kann es zu Schäden an der Zylinderkopfdichtung und dem Zylinderkopf kommen.

Die Kosten für einen defekten AGR-Kühler können je nach Fahrzeugmodell stark variieren. Meist müssen verschiedene Komponenten wie das Abgasrohr, die Ölrücklaufleitung oder der Dieselpartikelfilter demontiert werden, um an den Kühler zu gelangen. Sobald der Kühler freigelegt ist, kann er ersetzt werden. Die Verwendung von gebrauchten AGR-Kühlern empfiehlt sich nicht.

Bei einem defekten Abgaskühler wird auch das AGR-Ventil ersetzt, da es sich hier oft um ein Bauteil handelt. Anders sieht das aus, wenn die Komponenten getrennt verbaut wurden oder lediglich das AGR-Ventil beschädigt ist. In diesem Fall müssen nicht beide Bauteile gewechselt werden.

Ein neuer AGR-Kühler kostet im Original zwischen 750 Euro und 1.500 Euro. Dazu kommen die Materialkosten für das neue Kühlwasser – ca. drei bis fünf Liter. Zusammengerechnet kommen hierfür pro Liter etwa acht bis zehn Euro zusammen. In Summe ergeben sich ca. 25 Euro bis 50 Euro. Der Arbeitsaufwand für den Wechsel des Kühlers muss ebenfalls dazugerechnet werden. Hier fließt auch der Anlernvorgang des AGR-Ventils rein. Insgesamt müssen für den Aufwand vier bis sechs Stunden veranschlagt werden. Je nach Stundensatz kann es zu Arbeitskosten von ca. 400 Euro bis 600 Euro kommen. Die finalen Kosten variieren je nach Fahrzeugmodell. Insgesamt kann mit Kosten von rund 1.100 Euro und 2.200 Euro gerechnet werden.

Sie können Kosten sparen, indem Sie den neuen AGR-Kühler und dessen Komponenten günstig im Online-Shop von Motointegrator kaufen. In unserem Sortiment mit ca. 5 Millionen Kfz-Ersatzteilen und Zubehör finden Sie eine große Auswahl an Kfz-Ersatzteilen von Top-Marken und Premium-Herstellern zu einem günstigen Preis. Überzeugen Sie sich selbst davon und stöbern Sie in unserem Online-Shop. FAQs Was macht ein AGR-Kühler?

Der AGR-Kühler, auch als Abgaskühler bekannt, ist ein wesentlicher Bestandteil der Abgasrückführung (AGR), der dazu dient, einen Teil des Abgasstroms vor seiner Rückführung in den Ansaugtrakt zu kühlen. Er ist in den Kühlwasserkreislauf integriert. Bei einem defekten AGR-Kühler können erhöhte thermische Belastungen, Rückstände im Abgas und altersbedingter Verschleiß auftreten. Diese klar erkennbaren Symptome weisen auf Probleme hin, die einen sofortigen Austausch des AGR-Kühlers erfordern. Kann man mit defektem AGR-Kühler fahren?

Bei einem defekten AGR-Ventil sind normalerweise keine kurzfristigen Auswirkungen zu erwarten. Allerdings sollte bei einer schwerwiegenden Fehlfunktion, die das Notlaufprogramm des Motors aktiviert, sofort eine Werkstatt aufgesucht werden. Was passiert, wenn der AGR-Kühler kaputt ist?

Ein defekter AGR-Kühler kann zu einer Überhitzung des Motors führen, Rückstände im Abgas zurücklassen und altersbedingten Verschleiß begünstigen.

How to Fix NTFS Mount Errors and File System Corruption on Linux

jhuang@WS-2290C:/dev$ sudo mount /dev/sdg1 /media/jhuang/Elements $MFTMirr does not match $MFT (record 3). Failed to mount ‘/dev/sdg1’: Input/output error NTFS is either inconsistent, or there is a hardware fault, or it’s a SoftRAID/FakeRAID hardware. In the first case run chkdsk /f on Windows then reboot into Windows twice. The usage of the /f parameter is very important! If the device is a SoftRAID/FakeRAID then first activate it and mount a different device under the /dev/mapper/ directory, (e.g. /dev/mapper/nvidia_eahaabcc1). Please see the ‘dmraid’ documentation for more details.

The error indicates an issue with the NTFS file system on /dev/sdg1. Specifically, it points to a corruption in the $MFT (Master File Table) or a potential hardware issue. This is a common problem when NTFS file systems are not properly shut down or there are issues during file system operations. Solutions to try:

  1. Run chkdsk on Windows

    Since the error specifically mentions running chkdsk /f on Windows, you can try the following:

    • Unmount the drive if it’s currently mounted:

      sudo umount /dev/sdg1
    • Use a Windows machine (or dual-boot with Windows) to check the NTFS file system.

    • Once you connect the drive to a Windows machine, open a Command Prompt as Administrator and run:

      chkdsk /f X:

      Replace X: with the actual drive letter corresponding to your NTFS partition.

    • After chkdsk finishes, restart Windows twice as suggested in the error message to ensure the system fully clears any issues.

    • Reconnect the drive to your Linux machine and try mounting it again:

      sudo mount /dev/sdg1 /media/jhuang/Elements
  2. Check for Hardware Issues

    If chkdsk does not resolve the issue, there could be a problem with the hardware itself:

    • Try a different USB port (if it’s a USB drive) or different computer to rule out issues with your current setup.
    • If the drive shows signs of failure (slow access, clicking sounds, etc.), it could be a hardware issue.
  3. Use ntfsfix on Linux

    If you do not have access to a Windows machine, you can try using ntfsfix, which is a tool available on Linux that can sometimes resolve minor NTFS inconsistencies. It’s not as powerful as chkdsk but might help in some cases.

    • Install ntfs-3g (if not already installed):

      sudo apt-get install ntfs-3g
    • Run ntfsfix:

      sudo ntfsfix /dev/sdg1
    • Try mounting the partition again:

      sudo mount /dev/sdg1 /media/jhuang/Elements
  4. Check for RAID/FakeRAID Configuration

    If you are using RAID or FakeRAID (software RAID) on the device, this can cause issues with mounting the NTFS partition. The error message mentions the possibility of a FakeRAID, so you may need to use dmraid or ensure that the RAID is properly set up.

    • Check RAID status:

      sudo dmraid -s
    • If a RAID setup is detected, you might need to activate the RAID configuration and mount the device using /dev/mapper/, not directly via /dev/sdg1.

    • If you are unsure, refer to the documentation of your RAID setup for how to correctly mount the device.

  5. Backup Data (if possible)

    If the partition is still inaccessible after these steps, you may want to try data recovery tools like TestDisk or PhotoRec (both of which can handle NTFS partitions) to recover any important files before you attempt to format the drive.

  6. Summary:

    • Run chkdsk /f on Windows to repair the NTFS file system.
    • If the drive has hardware issues, consider using a different port or machine.
    • Try using ntfsfix on Linux as a last resort.
    • If using RAID, verify and manage the RAID setup.

What is rlog in DESeq2?

The regularized log transformation (rlog) in DESeq2 is a function designed to stabilize the variance in RNA-seq count data, making it more suitable for downstream statistical analyses. This transformation is particularly useful for reducing the heteroscedasticity (variance increases with the mean) commonly seen in RNA-seq data.

The function used in DESeq2 to apply the regularized log transformation is rlog(). It performs the transformation by modeling the counts as a function of size factors (accounting for library size differences) and then applying a regularization technique to smooth out the variance across genes, which helps with downstream visualizations and clustering.

Key points:

  • Purpose: The main goal of rlog is to stabilize variance across genes and make RNA-seq data more comparable, which helps to visualize relationships between samples and identify patterns of gene expression in techniques like principal component analysis (PCA) and clustering.

  • Variance Stabilization: Raw RNA-seq counts often show greater variance at higher expression levels. The rlog transformation reduces this effect, making low and high-expression genes more comparable in terms of variance.

  • Function Call:

      rlog_counts <- rlog(dds, blind = FALSE)
    
        #dds: A DESeqDataSet object containing your RNA-seq count data.
        #blind: If TRUE, the transformation ignores the experimental design (useful for exploratory analysis). If FALSE, it incorporates the experimental design into the transformation (recommended for differential expression analysis).

Example Workflow in DESeq2:

    library(DESeq2)

    # Assuming dds is a DESeqDataSet object created from raw count data
    dds <- DESeq(dds)

    # Apply the rlog transformation
    rlog_counts <- rlog(dds, blind = FALSE)

    # Use rlog-transformed data for PCA
    plotPCA(rlog_counts)

Conclusion:

The regularized log transformation is an important step in RNA-seq data analysis when you want to visualize the relationships between samples or perform clustering, as it stabilizes variance and removes the mean-dependent variability present in raw count data.

Virus Genome Analysis Pipeline: Hybrid Capture, DAMIAN Blastn, and VRAP Mapping for Measles (麻疹) Sample

Measles_S1_on_OR854811_reads_coverage

  1. DAMIAN calculations

    cd /mnt/nvme0n1p1/blast
    
    # -- Measles --
    #I think it would be good to first analyse the sample in DAMIAN Blastn and then map it to the closest related genome resulting from DAMIAN to see if we cover the entire sequence.
    damian.rb --host human3 --type rna -1 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_measles/p20578/N24_539_1A_measles_S1_R1_001.fastq.gz -2 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_measles/p20578/N24_539_1A_measles_S1_R2_001.fastq.gz --sample N24_539_1A_measles_S1_blastn --blastn progressive --blastp never --min_contiglength 500 --threads 40 --force
    damian_report.rb
    
    #Please find enclosed two links to further samples that require analysis with DAMIAN. In each case, the host is a human. The samples comprise three liver/serum samples (upper link below)  and 12 samples from the ring trial (lower link below).
    #This is merely an initial evaluation of samples with DAMIAN, using Blastn.
    
    # -- Leber/Serum --
    #Bei den Lexogen CORALL Libraries muss bei der Analyse darauf geachtet werden, dass diese in Read 1 standardmäßig einen UMI (12 nt) haben und die ersten 9 nt von Read 2 haben eine erhöhte Fehlerrate beim mapping (durch das Priming-Prinzip laut Hersteller) – die könnte man vor der weiteren Analyse trimmen.
    
    #cutadapt -u 12 -U 9 -o output_R1_trimmed.fastq -p output_R2_trimmed.fastq input_R1.fastq input_R2.fastq
    #
    #Explanation:
    #
    #    -u 12: Trim 12 bases from the start of Read 1 (this is for the UMI).
    #    -U 9: Trim 9 bases from the start of Read 2 (this is to deal with the higher error rate in the first 9 bases of Read 2).
    #    -o output_R1_trimmed.fastq: Output file for Read 1 after trimming.
    #    -p output_R2_trimmed.fastq: Output file for Read 2 after trimming.
    #
    #Why this works:
    #
    #    The -u option applies trimming to Read 1, while the -U option applies trimming to Read 2. This ensures that the correct bases are trimmed from each read, without affecting the other.
    #
    #Now when you run this command, Read 1 will have 12 bases trimmed, and Read 2 will have 9 bases trimmed, as expected.
    
    cutadapt -u 12 -U -9 -o Leber1_Corall_wo_Dnase_S5_R1_001.fastq -p Leber1_Corall_wo_Dnase_S5_R2_001.fastq p20575_L3/Leber1_Corall_wo_Dnase_S5_R1_001.fastq.gz p20575_L3/Leber1_Corall_wo_Dnase_S5_R2_001.fastq.gz
    cutadapt -u 12 -U -9 -o Leber2_Corall_wo_Dnase_S6_R1_001.fastq -p Leber2_Corall_wo_Dnase_S6_R2_001.fastq ./p20576_L3/Leber2_Corall_wo_Dnase_S6_R1_001.fastq.gz ./p20576_L3/Leber2_Corall_wo_Dnase_S6_R2_001.fastq.gz
    cutadapt -u 12 -U -9 -o Serum_Corall_wo_Dnase_S7_R1_001.fastq -p Serum_Corall_wo_Dnase_S7_R2_001.fastq ./p20577_L3/Serum_Corall_wo_Dnase_S7_R1_001.fastq.gz ./p20577_L3/Serum_Corall_wo_Dnase_S7_R2_001.fastq.gz
    
    for sample in Leber1_Corall_wo_Dnase_S5 Leber2_Corall_wo_Dnase_S6 Serum_Corall_wo_Dnase_S7; do
    java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 16 ${sample}_R1_001.fastq ${sample}_R2_001.fastq trimmed/${sample}_R1.fastq.gz trimmed/${sample}_unpaired_R1.fastq.gz trimmed/${sample}_R2.fastq.gz trimmed/${sample}_unpaired_R2.fastq.gz ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20;
    done
    
    fastqc ./trimmed/Leber1_Corall_wo_Dnase_S5_R1.fastq.gz
    fastqc ./trimmed/Leber1_Corall_wo_Dnase_S5_R2.fastq.gz
    
    damian.rb --host human3 --type rna -1 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Liver/trimmed/Leber1_Corall_wo_Dnase_S5_R1.fastq.gz -2 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Liver/trimmed/Leber1_Corall_wo_Dnase_S5_R2.fastq.gz --sample Leber1_Corall_wo_Dnase_S5_blastn --blastn progressive --blastp never --min_contiglength 500 --threads 60 --force
    damian_report.rb -a 37 -n Leber1_Corall_wo_Dnase_S5_blastn
    
    damian.rb --host human3 --type rna -1 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Liver/trimmed/Leber2_Corall_wo_Dnase_S6_R1.fastq.gz -2 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Liver/trimmed/Leber2_Corall_wo_Dnase_S6_R2.fastq.gz --sample Leber2_Corall_wo_Dnase_S6_blastn --blastn progressive --blastp never --min_contiglength 500 --threads 60 --force
    damian_report.rb -a 39 -n Leber2_Corall_wo_Dnase_S6_blastn
    
    damian.rb --host human3 --type rna -1 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Liver/trimmed/Serum_Corall_wo_Dnase_S7_R1.fastq.gz -2 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Liver/trimmed/Serum_Corall_wo_Dnase_S7_R2.fastq.gz --sample Serum_Corall_wo_Dnase_S7_blastn --blastn progressive --blastp never --min_contiglength 500 --threads 60 --force
    damian_report.rb -a 41 -n Serum_Corall_wo_Dnase_S7_blastn
    
    zip -r Serum_Corall_wo_Dnase_S7_blastn.zip Serum_Corall_wo_Dnase_S7_blastn
    
    # -- Ringversuch --
    #Hier waren die DNA-Libraries der Samples ungewöhnlich niedrig konzentriert (das Kit ist normalerweise sehr robust für niedrigen Input, von ChIP-Proben auch z.B.) und die Read-Anzahl ist auch niedriger als angepeilt. Vielleicht klärt die bioinformatische Analyse das ja etwas auf – bzw. wie viel DNA würdet ihr aus dem Ausgangsmaterial erwarten (waren es Zell-Spikeins in der RNA?).. Die RNA-Libraries aus den gleichen Proben sehen jeweils ok aus, Readanzahl wurde in etwa erreicht, die Duplikationsrate ist relativ hoch, was bei low input aber auch normal sein kann.
    
    damian.rb --host human3 --type rna -1 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20585/07_RV1_RNA_S7_R1_001.fastq.gz -2 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20585/07_RV1_RNA_S7_R2_001.fastq.gz --sample 07_RV1_RNA_S7_blastn --blastn progressive --blastp never --min_contiglength 500 --threads 100 --force
    damian_report.rb
    damian.rb --host human3 --type rna -1 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20586/08_RV2_RNA_S8_R1_001.fastq.gz -2 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20586/08_RV2_RNA_S8_R2_001.fastq.gz --sample 08_RV2_RNA_S8_blastn --blastn progressive --blastp never --min_contiglength 500 --threads 100 --force
    damian_report.rb
    damian.rb --host human3 --type rna -1 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20587/09_RV3_RNA_S9_R1_001.fastq.gz -2 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20587/09_RV3_RNA_S9_R2_001.fastq.gz --sample 09_RV3_RNA_S9_blastn --blastn progressive --blastp never --min_contiglength 500 --threads 100 --force
    damian_report.rb
    damian.rb --host human3 --type rna -1 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20588/10_RV4_RNA_S10_R1_001.fastq.gz -2 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20588/10_RV4_RNA_S10_R2_001.fastq.gz --sample 10_RV4_RNA_S10_blastn --blastn progressive --blastp never --min_contiglength 500 --threads 100 --force
    damian_report.rb
    damian.rb --host human3 --type rna -1 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20589/11_RV5_RNA_S11_R1_001.fastq.gz -2 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20589/11_RV5_RNA_S11_R2_001.fastq.gz --sample 11_RV5_RNA_S11_blastn --blastn progressive --blastp never --min_contiglength 500 --threads 100 --force
    damian_report.rb
    damian.rb --host human3 --type rna -1 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20590/12_RV6_RNA_S12_R1_001.fastq.gz -2 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20590/12_RV6_RNA_S12_R2_001.fastq.gz --sample 12_RV6_RNA_S12_blastn --blastn progressive --blastp never --min_contiglength 500 --threads 100 --force
    damian_report.rb
    #END
    
    cd jhuang@hamm:/mnt/h1/jhuang/blast
    damian.rb --host human3 --type dna -1 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20579/01_RV1_DNA_S1_R1_001.fastq.gz -2 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20579/01_RV1_DNA_S1_R2_001.fastq.gz --sample 01_RV1_DNA_S1_blastn --blastn progressive --blastp never --min_contiglength 500 --threads 55 --force
    damian_report.rb
    damian.rb --host human3 --type dna -1 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20580/02_RV2_DNA_S2_R1_001.fastq.gz -2 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20580/02_RV2_DNA_S2_R2_001.fastq.gz --sample 02_RV2_DNA_S2_blastn --blastn progressive --blastp never --min_contiglength 500 --threads 55 --force
    damian_report.rb
    damian.rb --host human3 --type dna -1 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20581/03_RV3_DNA_S3_R1_001.fastq.gz -2 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20581/03_RV3_DNA_S3_R2_001.fastq.gz --sample 03_RV3_DNA_S3_blastn --blastn progressive --blastp never --min_contiglength 500 --threads 55 --force
    damian_report.rb
    damian.rb --host human3 --type dna -1 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20582/04_RV4_DNA_S4_R1_001.fastq.gz -2 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20582/04_RV4_DNA_S4_R2_001.fastq.gz --sample 04_RV4_DNA_S4_blastn --blastn progressive --blastp never --min_contiglength 500 --threads 55 --force
    damian_report.rb
    damian.rb --host human3 --type dna -1 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20583/05_RV5_DNA_S5_R1_001.fastq.gz -2 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20583/05_RV5_DNA_S5_R2_001.fastq.gz --sample 05_RV5_DNA_S5_blastn --blastn progressive --blastp never --min_contiglength 500 --threads 55 --force
    damian_report.rb
    damian.rb --host human3 --type dna -1 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20584/06_RV6_DNA_S6_R1_001.fastq.gz -2 /home/jhuang/DATA/Data_Damian/241213_VH00358_120_AAG523FM5_Ringversuch/p20584/06_RV6_DNA_S6_R2_001.fastq.gz --sample 06_RV6_DNA_S6_blastn --blastn progressive --blastp never --min_contiglength 500 --threads 55 --force
    damian_report.rb
    #END
    
    #Sent: DNA_S1, S2, S3, RNA_S7, S8, S9, S10.
    #TODOs: DNA_S4, S5, S6, RNA_S11, S12.
  2. Trimming

    mkdir trimmed
    for sample in N24_539_1A_measles_S1; do
    java -jar /home/jhuang/Tools/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 16 ${sample}_R1_001.fastq.gz ${sample}_R2_001.fastq.gz trimmed/${sample}_R1.fastq.gz trimmed/${sample}_unpaired_R1.fastq.gz trimmed/${sample}_R2.fastq.gz trimmed/${sample}_unpaired_R2.fastq.gz ILLUMINACLIP:/home/jhuang/Tools/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 AVGQUAL:20;
    done
  3. Download all virus genomes (18644)

    mv datasets /usr/local/bin/
    chmod +x /usr/local/bin/datasets
    #datasets download virus genome --complete-only --assembly-source refseq under ~/REFs
    datasets download virus genome taxon "Viruses" --complete-only --refseq
    #To check for RefSeq data only, look for NC_, NM_, or similar prefixes in sequence headers and identifiers.
    wget -r -np -nH --cut-dirs=3 ftp://ftp.ncbi.nlm.nih.gov/genomes/Viruses/
    [Option1] sudo apt install ncbi-entrez-direct
    [Option1] esearch -db nucleotide -query "txid3052345[Organism:exp]" | efetch -format fasta > virus_genome_3052345.fasta  # 23788
     ~/Scripts/filter_fasta.py virus_genome_3052345.fasta virus_complete_genome_3052345.fasta
        import sys
        from Bio import SeqIO
    
        # Get input and output file paths from command line arguments
        input_file = sys.argv[1]
        output_file = sys.argv[2]
    
        # Open the output file to write filtered sequences
        with open(output_file, "w") as out_handle:
            # Parse the multi-line FASTA file
            for record in SeqIO.parse(input_file, "fasta"):
                # Check if 'complete genome' is in the header (description)
                if "complete genome" in record.description:
                    # Write the entire record (header + multi-line sequence) to the output file
                    SeqIO.write(record, out_handle, "fasta")
    
        print(f"Complete genome sequences saved to {output_file}")
    [Option1] grep "complete genome" virus_complete_genome_3052345.fasta | wc -l  #917
    [Option2] https://www.ebi.ac.uk/ena/browser/view/3052345
    [Option2] grep "complete genome" ena_3052345_sequence.fasta | wc -l  #820
  4. The commends for more comprehensive blast annotation, in order to get the closest related genome.

    ln -s ~/Tools/vrap/ .
    conda activate /home/jhuang/miniconda3/envs/vrap
    vrap/vrap.py  -1 trimmed/N24_539_1A_measles_S1_R1.fastq.gz -2 trimmed/N24_539_1A_measles_S1_R2.fastq.gz  -o N24_539_1A_measles_S1 --bt2idx=/home/jhuang/REFs/genome  --host=/home/jhuang/REFs/genome.fa --virus=/home/jhuang/REFs/ncbi_dataset/data/genomic.fna --nt=/mnt/nvme0n1p1/blast/nt --nr=/mnt/nvme0n1p1/blast/nr  -t 100 -l 200  -g
    [Optional] vrap/vrap.py  -1 trimmed/N24_539_1A_measles_S1_R1.fastq.gz -2 trimmed/N24_539_1A_measles_S1_R2.fastq.gz  -o N24_539_1A_measles_S1_3052345 --bt2idx=/home/jhuang/REFs/genome  --host=/home/jhuang/REFs/genome.fa --virus=/home/jhuang/REFs/virus_complete_genome_3052345.fasta --nt=/mnt/nvme0n1p1/blast/nt --nr=/mnt/nvme0n1p1/blast/nr  -t 100 -l 200  -g
    # * 4 nt_dbs (--virus, --host, download_db.py(nucleotide, currently alphaherpesvirus), nt), 2 prot_db (download_db.py(protein, currently alphaherpesvirus), nr) for blast, save under ./blast/db/virus, ./blast/db/host, vrap/database/viral_db/viral_nucleotide, vrap/database/viral_db/viral_protein
    # * 1 bowtie_database for host removal (--host), save under ./bowtie/host.
    # * bowtie run before assembly
    # * blast run after assembly for the contigs, therefore it does not exist the taxfilt step in vrap.
    # * checking the order of the databases for annotation step, namely which database will be taken firstly for annotionn after setting --virus?
    # * If --host is for both bowtie and blastn.
    # * if only --bt2idx define, only bowtie, no blastn! == no ""--host=/home/jhuang/REFs/genome.fa"" still has the host-removal step!
  5. Using the bowtie of vrap to map the reads on ref_genome/reference.fasta (The reference refers to the closest related genome found from the list generated by vrap)

    vrap/vrap.py  -1 trimmed/N24_539_1A_measles_S1_R1.fastq.gz -2 trimmed/N24_539_1A_measles_S1_R2.fastq.gz  -o N24_539_1A_measles_S1_on_OR854811 --host /home/jhuang/Downloads/OR854811.fasta   -t 100 -l 200  -g
    cd bowtie
    mv mapped mapped.sam
    samtools view -S -b mapped.sam > mapped.bam
    samtools sort mapped.bam -o mapped_sorted.bam
    samtools index mapped_sorted.bam
    samtools view -H mapped_sorted.bam
    samtools flagstat mapped_sorted.bam
    #5755885 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 secondary
    0 + 0 supplementary
    0 + 0 duplicates
    4457643 + 0 mapped (77.44% : N/A)
    1300648 + 0 paired in sequencing
    650324 + 0 read1
    650324 + 0 read2
    637608 + 0 properly paired (49.02% : N/A)
    901044 + 0 with itself and mate mapped
    77271 + 0 singletons (5.94% : N/A)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)
    #4457643 of 5755885 (77.44%) can map the reference.
    bamCoverage -b mapped_sorted.bam -o ../../Measles_S1_on_OR854811_reads_coverage.bw
  6. Show the bw on IGV