Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error creating Japanese NLP Pipeline #80

Open
gilliganc opened this issue Sep 8, 2022 · 4 comments
Open

Error creating Japanese NLP Pipeline #80

gilliganc opened this issue Sep 8, 2022 · 4 comments
Labels
bug Something isn't working

Comments

@gilliganc
Copy link

gilliganc commented Sep 8, 2022

Describe the bug
Trying to load the Pipeline for the Japanese model/language results in a MessagePackSerializationException This is on NET6 on windows 10.

To Reproduce

  1. add the japanese model nuget
  2. run the following code
Catalyst.Models.Japanese.Register();
var nlp = await Pipeline.ForAsync(Language.Japanese);

the second line will error with th exception in the Additional context

Expected behavior
Create the Pipeline without error and be able to perform NLP on japanese text.

Additional context

MessagePack.MessagePackSerializationException : Error occurred while reading from the stream.
---- System.NullReferenceException : Object reference not set to an instance of an object.

  Stack Trace: 
MessagePackSerializer.DeserializeAsync[T](Stream stream, MessagePackSerializerOptions options, CancellationToken cancellationToken)
StorableObjectV2`2.LoadAsync(Stream stream)
AveragePerceptronTagger.LoadAsync(Stream stream)
<<Register>b__0_7>d.MoveNext()
--- End of stack trace from previous location ---
ResourceLoader.LoadAsync[T](Assembly assembly, String resourceFile, Func`2 loader)
<<Register>b__0_0>d.MoveNext()
--- End of stack trace from previous location ---
StorableObject`2.LoadDataAsync()
AveragePerceptronTagger.FromStoreAsync(Language language, Int32 version, String tag)
Pipeline.ForAsync(Language language, Boolean sentenceDetector, Boolean tagger)
@gilliganc gilliganc added the bug Something isn't working label Sep 8, 2022
@theolivenbaum
Copy link
Collaborator

Hi @gilliganc , thanks for reporting it. This is probably because we don't have an AveragePerceptronTagger model for Japanese. I'll investigate how to improve this.

Meanwhile you can create a "Tokenizer" only pipeline

@gilliganc
Copy link
Author

thanks i think i need more than the tokenizer as i was trying to port some existing code from python to dotnet that was based around spacy to see if i could improve the performance and integrate it easier. Based on what the person that wrote the original code i need more than the tokeniser. We are trying to detect the keywords in the japanese text and the nouns i don't think just the the tokenizer would help right?

@CodeRabbit957
Copy link

CodeRabbit957 commented Mar 2, 2024

Is this being worked on? I still have this error. It's definitely the AveragePerceptronTagger (I'm getting NullReferenceException).

Does the tokenizer even work properly?

Is there a reason this spacy model has been ported without it? The Japanese model is pretty much useless right now if I can't get anything to work. How soon can this be fixed?

It looks like spacy haven't used Averaged Percepton Taggers since pre-version 2.0. They now use neural networks (matrix multiplication). Are all the Catalyst models based on APTs?

@theolivenbaum
Copy link
Collaborator

@CodeRabbit957 we've not updated the tagger as we're also ourselves not using it anymore in our app... In any case, Catalyst would need to incorporate a proper CJK tokenizer such as https://github.com/leungwensen/cjk-tokenizer to be able to correctly handle Japanese. If you're up for the challenge, PRs are welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants