Skip to content
/ ryoca Public

a minimal e-mail classifier powered by CRM114

Notifications You must be signed in to change notification settings

hrkokw/ryoca

Folders and files

NameName
Last commit message
Last commit date

Latest commit

2844a7f · Aug 24, 2024

History

19 Commits
Jun 21, 2024
Aug 24, 2023
Aug 24, 2024
Jun 21, 2024
Aug 24, 2024
Aug 24, 2024
Dec 13, 2023

Repository files navigation

Ryoca - a minimal e-mail classifier powered by CRM114

Repo contents

  • ryoca ... a email classifier
  • wktng ... a text preprocessor, written in Perl, to get non-ascii parts in UTF-8 text separated into n-gram style
  • ryoca-bootcamp ... an ad-hoc Bash script to initialize and update statistics in TOE strategy (read & modify before use!)
  • patches/ ... some small patches I personally apply to CRM114 and normalizemime

Quick start

Please do not forget to read ryoca and tweak some settings inside.

$ sudo make install
 
$ ryoca --learn=good < GOOD_MAIL_TO_LEARN
  :
$ ryoca --learn=junk < JUNK_MAIL_TO_LEARN
  :
 
$ ryoca < MAIL_TO_CLASSIFY
  => prints original content with `X-CRM114-Status: (Good|Junk|Unsure)' header added

License

Scripts are under GPLv3, same as CRM114 itself.

Patches are under the same licenses as the software to which each patch will be applied.

Thanks

FAQ

Why CRM114 in 2020s?

I had been using Bogofilter for over a decade and been reasonably satisfied. But preprocessing (e.g. tokenizing) multibyte texts, which Bogofilter isn't capable of, was the challenge to give a try for me receiving many Japanese emails.

CRM114 was the best to implement such email classifier without much effort.

Also, patching and installing CRM114 (and normalizemime) was super easy thanks to Portage on my Gentoo Linux server.

Why not official scripts?

One of the Ryoca's design goal is to be an simple, minimized version of the official mail(filter|reaver).crm. They seem to be too massive for me to handle trustfully.

Yes, for some extent, I'm reinventing the wheel.

Yet, Ryoca is a pure CRM114 script and utilizes the strong classifier in (hopefully) proper way. I believe its core functionality is guaranteed. Also it might be a good small sample for those who want to write their own CRM114 scripts.

How do you use it?

I use Dovecot, calling Ryoca via:

  • Sieve script, to add classification header to receiving emails and filter spams out
  • IMAPSieve script, to re-learn when I manually move emails into (or out of) specific IMAP folder

Here are some hints:

Please keep in mind that:

  • sieve_<extension>_input_eol must be set to lf because Ryoca doesn't support emails with CRLF line endings
  • vsz_limit for imap service (256M default) might be insufficient for invoking CRM114 via IMAPSieve, especially with the data window size expanded by the patch

What did the name come from?

I had watched a movie Library Wars (図書館戦争, Toshokan Sensō) just after starting development. There appeared a fictional organization called Media Betterment Committee (メディア良化委員会, Media Ryōka Iinkai) and its military Media Betterment Force (メディア 良化隊, Media Ryōka Tai).

They burn books, relentlessly.

Disclaimer: I'm definitely against censorship and burning books, although I burn spams sent to me with pleasure.

About

a minimal e-mail classifier powered by CRM114

Resources

Stars

Watchers

Forks