Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2400 Mycobacterium phage genomes annotated, only 700 were successful #39

Open
qq9236247 opened this issue Dec 23, 2024 · 9 comments
Open

Comments

@qq9236247
Copy link

I'm from https://phagesdb.org/ I downloaded 2400 genomes of mycobacterial bacteriophages and annotated them in bulk, but only over 700 were successfully annotated. I don't know where the problem lies.
Taking the genome in the attachment as an example, I have failed to annotate it individually or in bulk.
https://phagesdb.org/media/fastas/Abinghost.fasta

@npbhavya
Copy link
Collaborator

I can run sphae annotate on the provided fasta file without a problem. I didn't run into an error. Can you confirm that you are running the latest version of sphae, which is v1.4.5. If you are, can you send me the sphae.log generated in the output directory and the command run.

Finally, the other suggestion is to download and run the sphae docker instance, which would ensure the correct versions are run.

@qq9236247
Copy link
Author

qq9236247 commented Dec 26, 2024 via email

@npbhavya
Copy link
Collaborator

That's great; let me know if you run into any other errors.

@qq9236247
Copy link
Author

qq9236247 commented Jan 1, 2025 via email

@npbhavya
Copy link
Collaborator

npbhavya commented Jan 2, 2025

Hi,

I added a script to this repo now, here is the link https://github.com/linsalrob/sphae/blob/main/misc/merging_output.py.

To run the script, copy the summary.txt files to a new directory and then run the above script on that directory. Here is an example, python misc/merging_output.py <directory with summary files> <output file name>

I look for the columns, "Failed during assembly", "No contigs assigned viral" and if the column lists "Yes" then the sample failed either because the assembly didn't generate any contigs, or the assembly did generate contigs and they were either too short or none of them were assigned as viral.

Hope this helps, let me know if you run into any errors with this script or if you need the script fixed to add any other features.

@qq9236247
Copy link
Author

qq9236247 commented Jan 2, 2025 via email

@npbhavya
Copy link
Collaborator

npbhavya commented Jan 7, 2025

That sounds good; that sounds like a useful script. Can you commit it to a miscellaneous directory in the repo? I will take a look and add it to the main repo accordingly.

Regarding the lifestyle, we don't use repressor genes but use the presence of integrase as an indicator of a lysogenic lifestyle currently. Do you have a reference citation for using repressor genes to indicate this? It will be good to add it in as well.

Regarding the 5 genomes that didn't assemble, as the raw data is really large, you can use programs to subsample the data using rasusa https://github.com/mbhall88/rasusa. I am happy to help as well, if you can send me a link to the data to [email protected].

@qq9236247
Copy link
Author

qq9236247 commented Jan 12, 2025 via email

@qq9236247
Copy link
Author

qq9236247 commented Jan 13, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants