2400 Mycobacterium phage genomes annotated, only 700 were successful #39

qq9236247 · 2024-12-23T11:52:29Z

I'm from https://phagesdb.org/ I downloaded 2400 genomes of mycobacterial bacteriophages and annotated them in bulk, but only over 700 were successfully annotated. I don't know where the problem lies.
Taking the genome in the attachment as an example, I have failed to annotate it individually or in bulk.
https://phagesdb.org/media/fastas/Abinghost.fasta

npbhavya · 2024-12-23T15:09:11Z

I can run sphae annotate on the provided fasta file without a problem. I didn't run into an error. Can you confirm that you are running the latest version of sphae, which is v1.4.5. If you are, can you send me the sphae.log generated in the output directory and the command run.

Finally, the other suggestion is to download and run the sphae docker instance, which would ensure the correct versions are run.

qq9236247 · 2024-12-26T03:53:30Z

Thank you for your reply. I have retested and it was successful this time. Additionally, I have tested two previous comments that failed but were also successful. I plan to re annotate 2400 genomes and provide feedback if there are any issues. thank you Bhavya Papudeshi ***@***.***> 于2024年12月23日周一 23:09写道：

…

I can run sphae annotate on the provided fasta file without a problem. I didn't run into an error. Can you confirm that you are running the latest version of sphae, which is v1.4.5. If you are, can you send me the sphae.log generated in the output directory and the command run. Finally, the other suggestion is to download and run the sphae docker instance, which would ensure the correct versions are run. — Reply to this email directly, view it on GitHub <#39 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BN2EPQ47F64BKCOLJRJKUMT2HARSZAVCNFSM6AAAAABUC5X5ZGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNJZHA3TINBTHA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

npbhavya · 2024-12-30T02:21:12Z

That's great; let me know if you run into any other errors.

qq9236247 · 2025-01-01T13:54:45Z

I conducted a batch analysis of 2,492 phage genomes. After the program finished running, it indicated that there were some errors. Upon checking, I found 2,490 results, meaning 2 were missing. However, due to the large volume, I have not yet identified which specific two were not completed. Bhavya Papudeshi ***@***.***> 于2024年12月30日周一 10:21写道：

…

That's great; let me know if you run into any other errors. — Reply to this email directly, view it on GitHub <#39 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BN2EPQ6ABIDJVBI3O3KJ6K32ICU23AVCNFSM6AAAAABUC5X5ZGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRUHE2TSMZQGE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

npbhavya · 2025-01-02T09:30:30Z

Hi,

I added a script to this repo now, here is the link https://github.com/linsalrob/sphae/blob/main/misc/merging_output.py.

To run the script, copy the summary.txt files to a new directory and then run the above script on that directory. Here is an example, python misc/merging_output.py <directory with summary files> <output file name>

I look for the columns, "Failed during assembly", "No contigs assigned viral" and if the column lists "Yes" then the sample failed either because the assembly didn't generate any contigs, or the assembly did generate contigs and they were either too short or none of them were assigned as viral.

Hope this helps, let me know if you run into any errors with this script or if you need the script fixed to add any other features.

qq9236247 · 2025-01-02T14:07:03Z

wxd.zip <https://drive.google.com/file/d/1wMbz_tlmyXijNW5eYsooPe-3F8kN9Fz-/view?usp=drive_web> Thank you very much. I wrote a script myself to extract information from the summary.txt, but this only works for genomes that have been successfully annotated. Genomes that have failed annotation do not generate a summary.txt file. From this perspective, it is not possible to identify failed annotations using this method. At the same time, I also extract the positions of specific genes from summary.functions (multiple genes can be extracted at once). These two features might be quite useful, and I recommend packaging them into the software. The purpose of the second feature I wrote was to extract repressor genes to help determine whether the bacteriophage has a lysogenic property, which is also very useful for guiding whether it can be used in clinical treatment. I’m not sure if there is a better and more direct method to help users make this judgment. Note: There is a function in my code that extracts classifications from genome names, which I might be the only one to use. Attached are the original software code, the extraction results of the summary from 2490 genomes I ran, and the execution commands. Additionally, I tested de novo assembly on 5 genomes (which I had successfully assembled using Unicycler), but all of them failed due to the raw data being too large. I’m unsure how to send them to you for testing. Currently, I am only using the annotation feature. Bhavya Papudeshi ***@***.***> 于2025年1月2日周四 17:30写道：

…

Hi, I added a script now here, https://github.com/linsalrob/sphae/blob/main/misc/merging_output.py. I copy the summary.txt files to a new directory and then run the above script on that directory. Here is an example, python misc/merging_output.py <directory with summary files> <output file name> I look for the columns, "Failed during assembly", "No contigs assigned viral" and if the column lists "Yes" then the sample failed either because the assembly didn't generate any contigs, or the assembly did generate contigs and they were either too short or none of them were assigned as viral. Hope this helps, let me know if you run into any errors with this script or if you need the script fixed to add any other features. — Reply to this email directly, view it on GitHub <#39 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BN2EPQZY77K25X277EMSQGL2IUBMXAVCNFSM6AAAAABUC5X5ZGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNRXGQ4DMOBZGI> . You are receiving this because you authored the thread.Message ID: ***@***.***>

npbhavya · 2025-01-07T23:28:41Z

That sounds good; that sounds like a useful script. Can you commit it to a miscellaneous directory in the repo? I will take a look and add it to the main repo accordingly.

Regarding the lifestyle, we don't use repressor genes but use the presence of integrase as an indicator of a lysogenic lifestyle currently. Do you have a reference citation for using repressor genes to indicate this? It will be good to add it in as well.

Regarding the 5 genomes that didn't assemble, as the raw data is really large, you can use programs to subsample the data using rasusa https://github.com/mbhall88/rasusa. I am happy to help as well, if you can send me a link to the data to [email protected].

qq9236247 · 2025-01-12T09:37:11Z

I am an ICU doctor and not very familiar with operating GitHub, so in my last email, I sent you the source files of the code I wrote. Regarding the issue of phage lysogeny, I am also not very familiar with it since I have only been studying phages for a year. This time, I analyzed 2,400 phage genomes from https://phagesdb.org/, and the results showed that 515 of them lack the *int* gene. However, 164 of these phages are classified as "Temperate" according to the cluster information on https://phagesdb.org/clusters/ (see the attachment for details). That’s why I thought about extracting the *rep* gene for analysis. However, there are 13 phages that lack both the *rep* gene and the *int* gene but are still classified as "Temperate." I plan to systematically review the literature soon to address this issue. I also hope to get your help regarding this matter. Thank you. Bhavya Papudeshi ***@***.***> 于2025年1月8日周三 07:29写道：

…

That sounds good; that sounds like a useful script. Can you commit it to a miscellaneous directory in the repo? I will take a look and add it to the main repo accordingly. Regarding the lifestyle, we don't use repressor genes but use the presence of integrase as an indicator of a lysogenic lifestyle currently. Do you have a reference citation for using repressor genes to indicate this? It will be good to add it in as well. Regarding the 5 genomes that didn't assemble, as the raw data is really large, you can use programs to subsample the data using rasusa https://github.com/mbhall88/rasusa. I am happy to help as well, if you can send me a link to the data to ***@***.*** — Reply to this email directly, view it on GitHub <#39 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BN2EPQ27J226OBF3VAPYAGT2JRPL7AVCNFSM6AAAAABUC5X5ZGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZWGQYTSMBZGU> . You are receiving this because you authored the thread.Message ID: ***@***.***>

qq9236247 · 2025-01-13T12:05:14Z

Regarding the issue of the lack of integrase in the phage portion I analyzed, I have found literature to explain: https://pmc.ncbi.nlm.nih.gov/articles/PMC6282025 IF: 3.7 Q2 B2/ xiaodong wu ***@***.***> 于2025年1月12日周日 17:36写道：

…

I am an ICU doctor and not very familiar with operating GitHub, so in my last email, I sent you the source files of the code I wrote. Regarding the issue of phage lysogeny, I am also not very familiar with it since I have only been studying phages for a year. This time, I analyzed 2,400 phage genomes from https://phagesdb.org/, and the results showed that 515 of them lack the *int* gene. However, 164 of these phages are classified as "Temperate" according to the cluster information on https://phagesdb.org/clusters/ (see the attachment for details). That’s why I thought about extracting the *rep* gene for analysis. However, there are 13 phages that lack both the *rep* gene and the *int* gene but are still classified as "Temperate." I plan to systematically review the literature soon to address this issue. I also hope to get your help regarding this matter. Thank you. Bhavya Papudeshi ***@***.***> 于2025年1月8日周三 07:29写道： > That sounds good; that sounds like a useful script. Can you commit it to > a miscellaneous directory in the repo? I will take a look and add it to the > main repo accordingly. > > Regarding the lifestyle, we don't use repressor genes but use the > presence of integrase as an indicator of a lysogenic lifestyle currently. > Do you have a reference citation for using repressor genes to indicate > this? It will be good to add it in as well. > > Regarding the 5 genomes that didn't assemble, as the raw data is really > large, you can use programs to subsample the data using rasusa > https://github.com/mbhall88/rasusa. I am happy to help as well, if you > can send me a link to the data to ***@***.*** > > — > Reply to this email directly, view it on GitHub > <#39 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/BN2EPQ27J226OBF3VAPYAGT2JRPL7AVCNFSM6AAAAABUC5X5ZGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZWGQYTSMBZGU> > . > You are receiving this because you authored the thread.Message ID: > ***@***.***> >

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2400 Mycobacterium phage genomes annotated, only 700 were successful #39

2400 Mycobacterium phage genomes annotated, only 700 were successful #39

qq9236247 commented Dec 23, 2024

npbhavya commented Dec 23, 2024

qq9236247 commented Dec 26, 2024 via email

npbhavya commented Dec 30, 2024

qq9236247 commented Jan 1, 2025 via email

npbhavya commented Jan 2, 2025 •

edited

Loading

qq9236247 commented Jan 2, 2025 via email

npbhavya commented Jan 7, 2025

qq9236247 commented Jan 12, 2025 via email

qq9236247 commented Jan 13, 2025 via email

2400 Mycobacterium phage genomes annotated, only 700 were successful #39

2400 Mycobacterium phage genomes annotated, only 700 were successful #39

Comments

qq9236247 commented Dec 23, 2024

npbhavya commented Dec 23, 2024

qq9236247 commented Dec 26, 2024 via email

npbhavya commented Dec 30, 2024

qq9236247 commented Jan 1, 2025 via email

npbhavya commented Jan 2, 2025 • edited Loading

qq9236247 commented Jan 2, 2025 via email

npbhavya commented Jan 7, 2025

qq9236247 commented Jan 12, 2025 via email

qq9236247 commented Jan 13, 2025 via email

npbhavya commented Jan 2, 2025 •

edited

Loading