DeepSpam milter v2 development
Project history:
- deepspam1 tf2-keras model ported to pytorch
- new model using SentencePiece and embedding from pretrained XLM
- new metrics for model eval, optimized for spam filtering
- dataset found to be wrong, requires a full review and cleanup :(
- old mailer3 unable to handle dataset properly (utf8, bad html etc)
- mailer3 ported to python3 -> mailer4 initial version (same ui/keys)
- wide unicode display issues -> wcwidth/widechars added/implemented...
- old html2txt used in deepspam1 found to be wrong... new html parser developed!
- python's mime email parser found to be sloooow and sometimes broken -> implemented my own
- mailer4: integrated deepspam model evaluation, see Screenshot-mailer4-diffmode.png
- mailer4: added search, selection, deduplication, tagging features - tested on real data
- torch_emb.py: direct rewrite of deepspam1's model train code to pytorch
- torch_spm3.py: new model trainer, uses py class from model/ dir
- maildedup3.py: email deduplication and parsing, from mbox to txt
- mailer4.py: Python3 version of my old email reader, used primarily for spam-dataset preparation & model eval
- eml2str.py: my mime email parser and html2txt converter functions
- hdrdecode.py: my email header parser functions, from the pymavis/spamwall project
- ttykeymap.py: python version of my old getch2.c, console i/o functions and text UI
- widechars.py: wide unicode utilities, based on https://github.com/jquast/wcwidth
- striprtf.py: rtf to txt converter, from https://github.com/joshy/striprtf