-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Tamil Wiktionary Word Extraction #330
Comments
You downloaded the wrong dump file, should use "tawiktionary-20230901-pages-articles.xml.bz2"(the file with bold font in the download page: https://dumps.wikimedia.org/tawiktionary/20230901/) wiktextract currently supports Chinese, English, French Wiktionary dump file, both Chinese and French are WIP. Each Wiktionary edition should use its own extractor code in the "extractor" folder, or use the English extractor otherwise. Our priority for now is to improve the Chinese and French code, maybe in the future we'll support new languages. |
I tried the same command on the dump file you pointed me to, and I got similar results. Lot of the same error messages and only a very small list of words. |
Because Tamil Wiktionary is not supported, it at least should have some subtitle data files in the "data" folder and pass the "--dump-file-language-code" option. |
Yeah, unfortunately different editions of Wiktionary are so incompatible that you need to do a lot of work to make one work with Wiktextract. All the effort up to now has been to get en.wiktionary.org to work, and even that is still incomplete after a few years; but at least it's possible to build upon the framework and lots of almost-universal code to write code that can extract stuff from other editions. You'd also need someone who knows the language that the edition has been written in (in your case Tamil) so that they can figure out what needs to happen when parsing, and they need to extract all the necessary metadata xxyzz was talking about that goes in the data directory and then write the code necessary to handle the pages themselves. |
When I run wiktwords on this dump: https://dumps.wikimedia.org/tawiktionary/latest/ , using the following command:
wiktwords --all --language ta --out tamildata.json tawiktionary-latest-pages-meta-history.xml.bz2
I get a lot of error messages of the form:
"DEBUG: unexpected top-level node: <LEVEL6...".
Only a small fraction of the words are ending up in the output json file. Can you add support for the Tamil wiktionary: https://ta.wiktionary.org/ ?
Thanks!
The text was updated successfully, but these errors were encountered: