Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCTIS could not evaluate an external result? #71

Closed
KesselZ opened this issue Sep 3, 2022 · 4 comments
Closed

OCTIS could not evaluate an external result? #71

KesselZ opened this issue Sep 3, 2022 · 4 comments

Comments

@KesselZ
Copy link

KesselZ commented Sep 3, 2022

  • OCTIS version: 1.10.4
  • Python version: 3.9
  • Operating System: Windows 11

Description

I got an error:
unable to interpret topic as either a list of tokens or a list of ids.

What I Did

I use another method to get the topics of 20newsgroup, and I want to use the metrics provided by octis to test their quality.

So, I have many lists of topics. for example, one list is: ['cheek', 'yep', 'huh', 'ken', 'lets', 'ignore', 'forget', 'art', 'dilemma', 'dilemna']. I need to calculate the topic cohesion between these topics and the document(corpus).

As a topic modeling metrics system, I thought OCTIS may do this for me. However, it is hard.

I got this error because: among my result topics, some of the words are not in the corpus of 20newsgroup provided by OCTIS. I got my data from scikit-learn's 20newsgroup. So I think the only explanation is that the corpus of 20newsgroup from scikit-learn and OCTIS is different.

Therefore, it seems that the only solution is to use OCTIS's dataset to do the training. And then use OCTIS's evaluation system to do the topic cohesion. Does this mean that OCTIS is not accepting external topics?

Not sure if there are any other solutions for this case. I believe OCTIS should be able to work with external topic modeling methods. I just did not find the way. So please tell me if there is any suggestions.

@KesselZ
Copy link
Author

KesselZ commented Sep 3, 2022

I just saw that in the introduction of OCTIS, it was mentioned that OCTIS provides the 20newsgroup. But it says "# Docs: 16309".

However, 20newsgroup from scikit-learn has about 18000 documents. Is this the reason they are not compatible with each other?

@KesselZ
Copy link
Author

KesselZ commented Sep 3, 2022

An update is: BBCNews works properly. The difference is that BBCNews in sklean has the size of 2225 as well, they are the same as the description in OCTIS. So I think the reason for 20newsgroup not working is that OCTIS provides the 20newsgroup corpus with the wrong size?

@silviatti
Copy link
Collaborator

Hello,
20Newsgroup in OCTIS is different from the other version because we preprocessed it. This means that it also removes documents with less than a certain number of words. That's why the two number of documents do not match. However, you can use OCTIS just for evaluation without training a new topic model.

For example, if you want to use topic coherence, you can do the following:

# the list of topics
topics = {"topics": [['cheek', 'yep', 'huh', 'ken', 'lets', 'ignore', 'forget', 'art', 'dilemma', 'dilemna'], ....]}

# this is the list of documents that you want to use as a reference to compute the topic coherence, 
# i.e. in your case, scikit's 20newsgroups 
texts = [['cheek', 'yep'], [ 'yep', 'huh', 'lets'], ....] 

# define the metric and provide texts as input 
npmi = Coherence(texts=texts, topk=10, measure='c_npmi')

# get the score
npmi.score(topics)

Hope it helps!

Silvia

@KesselZ
Copy link
Author

KesselZ commented Oct 17, 2022

Hello, 20Newsgroup in OCTIS is different from the other version because we preprocessed it. This means that it also removes documents with less than a certain number of words. That's why the two number of documents do not match. However, you can use OCTIS just for evaluation without training a new topic model.

For example, if you want to use topic coherence, you can do the following:

# the list of topics
topics = {"topics": [['cheek', 'yep', 'huh', 'ken', 'lets', 'ignore', 'forget', 'art', 'dilemma', 'dilemna'], ....]}

# this is the list of documents that you want to use as a reference to compute the topic coherence, 
# i.e. in your case, scikit's 20newsgroups 
texts = [['cheek', 'yep'], [ 'yep', 'huh', 'lets'], ....] 

# define the metric and provide texts as input 
npmi = Coherence(texts=texts, topk=10, measure='c_npmi')

# get the score
npmi.score(topics)

Hope it helps!

Silvia

Thanks for your help!

@KesselZ KesselZ closed this as completed Oct 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants