-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use stopword lists to reduce index size and improve search results #250
Comments
This is a good idea - I'll incorporate it into a future release. Thanks! |
Resource: NLTK stopwords by language (25 languages) On a per language basis, my high level "want" is:
I just started working with Stork this morning. Very, very nice. |
To simplify things, I think the first approach should be just to add a list. Including the NLTK list would be nice, but basically adding the whole list by myself is the feature you always need at the end. So I would start with that and perhaps include ease of use features later. |
To keep the search index clean and small and improve search results it would be good to have the possibility to remove common words from index. Examples would be something like this, that, a, and or ein, eine, weil, dass, der, die, das for German based indices.
For most sites it would decrease the size of index and would improve search results. For the search term "and" we would not return ”and he goes...", "and Peter...” but something like "Android", "Andreas".
This is not only the case for these common words in the language, but if the list is well chosen, for other words too.
E.g having a site about coffee, the word "coffee" will be on nearly every site and it could make sense to remove this from index because the search results would just be a full representation of the entire site.
The text was updated successfully, but these errors were encountered: