Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: Could Not Determine Alphabet of File When Using digital=True in esl.SequenceFile #80

Closed
BioGavin opened this issue Oct 26, 2024 · 2 comments
Labels
question Further information is requested

Comments

@BioGavin
Copy link

Hi, authors.
I’m encountering an issue when trying to read a file using esl.SequenceFile with the digital=True parameter. Here is the code I’m using for test:

import pyhmmer.easel as esl
in_fasta_path = "test.fa"
sequences = esl.SequenceFile(in_fasta_path, digital=True)
for sequence in sequences:
    print(f"Name: {sequence.name.decode('utf-8')}")
    print(sequence.sequence)

The test.fa file contains the following sequence in FASTA format:

>bgc:465365|cds:8530054|hsp:8934241|18-46
TYYGNGVSCDDKKCTVDWGKAWSCGADR

When I set digital=True, I get the following error:

Traceback (most recent call last):
  File "/home/gavin/bigslice-cj/debug/read_fa.py", line 7, in <module>
    sequences = esl.SequenceFile(in_fasta_path, digital=True)
  File "pyhmmer/easel.pyx", line 6289, in pyhmmer.easel.SequenceFile.__init__
  File "pyhmmer/easel.pyx", line 6283, in pyhmmer.easel.SequenceFile.__init__
ValueError: Could not determine alphabet of file: 'test.fa'

If I don't set digital, it can run successfully and the output is here:

/home/gavin/miniconda3/envs/bigslice/bin/python /home/gavin/bigslice-cj/debug/read_fa.py 
Name: bgc:465365|cds:8530054|hsp:8934241|18-46
TYYGNGVSCDDKKCTVDWGKAWSCGADR

Process finished with exit code 0

Here is the version information of pyhmmer I used:

Name: pyhmmer
Version: 0.10.15
Summary: Cython bindings and Python interface to HMMER3.
Home-page: https://github.com/althonos/pyhmmer
Author: Martin Larralde
Author-email: [email protected]
License: MIT
Location: /home/gavin/miniconda3/envs/bigslice/lib/python3.8/site-packages
Requires: psutil
Required-by: bigslice

I understand that the digital=True parameter is intended to convert amino acid letters to numeric values in the range 0-19. I have carefully checked my input sequence to ensure there are no invalid amino acid letters; all characters in the sequence conform to the standard protein alphabet. Despite this, I am still encountering the ValueError: Could not determine alphabet of file error. This is quite puzzling, and I would appreciate any guidance or insight you could provide on this issue.

Thank you for your help!

@althonos
Copy link
Owner

althonos commented Oct 26, 2024

Hi @BioGavin

This is quite likely coming from HMMER not being able to determine the alphabet of your sequence file because it is too short, and since digital=True requires an alphabet to succeed, the parser fails in digital mode but not in text mode.

If you know your sequences are always protein sequences you can provide an alphabet yourself:

import pyhmmer.easel as esl
in_fasta_path = "test.fa"
alphabet = esl.Alphabet.amino()
sequences = esl.SequenceFile(in_fasta_path, digital=True, alphabet=alphabet)
for sequence in sequences:
    print(f"Name: {sequence.name.decode('utf-8')}")
    print(sequence.sequence)

@BioGavin
Copy link
Author

Thank you for your response. This solution worked perfectly, and the code now runs successfully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants