Skip to content
mcthulhu edited this page Feb 16, 2021 · 22 revisions

Welcome to the Jorkens wiki!

Current issues to work on

  • importing many more dictionary formats, including Yomichan and Kobo dictionaries
  • handling extremely large data sets as a single file is problematic; Jorkens should use SAX for parsing large XML files
  • concordance search results are currently limited to 100, because of memory issues with retrieving and loading very large numbers of results at once; Jorkens should do some sort of lazy loading or use offsets to bring in additional portions of the database results
  • add a tabular edit view for the glossary/dictionary and translation memory databases
  • add an option to export the translation memory
  • allow filtering search results and the library view with tags
  • add a tool bar with icons, and add a lot more keyboard shortcuts
  • need to find a better way to track each book's current reading location, especially when switching books; a database might be more scalable than localStorage

Tips

  • The first step after installing Jorkens and TreeTagger, and setting the user's native language if necessary, should probably be to populate the glossary database, so that results can be returned for word look-ups. Shared Anki decks from https://ankiweb.net/shared/decks/ can be used to populate both the glossary and TM (parallel sentence) databases, depending on whether the decks represent vocabulary items or collections of sentences. After importing them into Anki, export them again to a text file; the "Notes in plain text" format seems to work well. This should yield a text file with one pair per line, with the fields separated by tabs. Make sure that the first column is the foreign language. Caveat: shared Anki decks vary in size and quality.

  • Another good source of dictionary data is the Wiktionary-based bilingual dictionaries at http://download.wikdict.com/dictionaries/sqlite/2_2019-02/, which are available for a number of language pairs. Download the appropriate language pair, and unzip the .gz file. The .sqlite3 file can be exported to a .csv file from which you can extract the columns needed. See https://www.sqlitetutorial.net/sqlite-tutorial/sqlite-export-csv/ for directions on how to do this. (You might also want to remove the accent marks, the \u0301 characters.)

  • Another source of converted Wiktionary dictionaries is https://github.com/BoboTiG/ebook-reader-dict, which has monolingual dictionaries for Catalan, English, Spanish, French, Portuguese, and Swedish, with instructions on how to create others.

  • https://github.com/facebookresearch/MUSE is another source of dictionaries for numerous languages; but note that these dictionaries contain numerous technical terms, place names and personal names, etc. which may be unwanted. Jorkens now has an option to import these dictionaries once saved as local files.

  • If you have a glossary or dictionary in an Excel spreadsheet format, copying adjacent columns into a text file (e.g. in Notepad) should also produce a usable tab-delimited text file.

  • https://www.mobileread.com/forums/showthread.php?t=232883 is a very long thread listing many foreign language dictionaries available in Kobo format. An option has been added to import Kobo dictionaries directly, by opening the .zip file.

  • A good source for translation memory (TMX) files to import into Jorkens to support concordance searches is OPUS at http://opus.nlpl.eu/index.php. It has downloadable TMX files for many language pairs from publicly available sources, including UN and EU documents, Global Voices, and movie subtitles. https://farkastranslations.com/bilingual_books.php has a number of classic literary works in multiple languages, in a format that can be converted into TMX files fairly easily.

  • Jorkens can accept reversed TMX files as well, e.g., if the user's native language is set to "en," then de-en.tmx and en-de.tmx should be imported the same way regardless of direction.

  • Until there are built-in options to edit the data, a SQLite editor like DB Browser can be used to edit database records; the database can be found in My Documents\Jorkens\db.

  • Jorkens expects to find Python scripts in My Documents\Jorkens\Python. The Python executable path is currently hard-coded as C:\Python38\python.exe for Windows, and /usr/bin/python3 for Linux. Python scripts can load the currentChapter.txt or bookText.txt file as the data to analyze (using NLTK or other packages).

  • Jorkens sets the current foreign language (which determines the dictionary menus available, etc.) automatically from the language code in the book. If that code is incorrect, which may be the case from Calibre conversions, etc., then there is an option under the Tools menu to force the current working language to another one. This might also be useful if you want to import a dictionary or TMX file in another language without leaving your current book; you can change the language temporarily and then change it back.

  • A word frequency list generated from the book now takes into account stopwords and lemmatization, so the top word for a Spanish book, for example, is likely to be "ser."

  • Highlights and annotations are now working; see Tools/Mark passage. These highlights should be restored the next time the same book is loaded. Click on a highlight to see the text of an attached note. For the time being, only yellow highlights are enabled. The database entries containing saved passages will later be used as bookmarks.

  • Jorkens now supports additional ebook formats, by converting them to .epub first and then opening them. This requires that Calibre and its ebook-convert.exe tool are installed.

  • An audio player using wavesurfer.js has been added to support listening to audiobooks from local audio files while reading. The audio formats used in the file picker are .mp3, .wav, and .ogg.

Future goals and planned features (this list will probably grow)

  • text segmentation into sentences, and TF/IDF weighting of words and multiword ngrams, to support key word and phrase extraction
  • add a database to track time, reading speed, number of words looked up per page (for example), and any other statistics that could help measure improvement over time, with appropriate graphs
  • show a measure of reading difficulty, at least in terms of vocabulary size and frequency level
  • look into useful Chrome extensions that could be incorporated, especially for Japanese
Clone this wiki locally